MSSQL Server Not Using NonClustered Composite Key Index (PK + FK) on InnerJoin - sql

Having the following structure:
Table Auction (Id_Auction (Pk), DateTime_Auction)
Table Auction_Item (Id_Auction_Item (Pk), Id_Auction (Fk), Id_Winning_Bid (Fk), Item_Description)
Table Bid (Id_Bid (Pk), Id_Auction_Item (Fk), Id_Bidder (Fk), Lowest_Value, Highest_Value)
Table Bidder (Id_Bidder (Pk), Name)
Indexes for Auction are not relevant.
Indexes for Auction_Item:
Clustered Index PK_Auction_Item (Id_Auction_Item)
NonClustered Index IX_Auction_Item_IdWinningBid (Id_Winning_Bid)
Indexes for Bid:
Clustered Index PK_Bid (Id_Bid)
NonClustered Index IX_Bid_IdBidder (Id_Bidder)
NonClustered Index IX_Bid_IdBid_IdBidder (Id_Bid, Id_Bidder) Unique Included (Id_Auction_Item, Lowest_Value, Highest_Value)
Indexes for Bidder are not relevant.
I'll ask you to bear with me a little... This structure is only to you recognize the relationship between the tables/data and is not intendent to be following best practices. The actual database is really more complex (Table "Bid" is like 54 millions rows). Oh, Yes, each Auction_Item will have only one "Bid per Bidder" with his highest and lowest bid.
So, when I execute the following query:
Select
Auc.Id_Auction,
Itm.Id_Auction_Item,
Itm.Item_Description,
B.Id_Bid,
B.Lowest_Value,
B.Highest_Value
From
Auction Auc
Inner Join Auction_Item Itm on Itm.Id_Auction = Auc.Id_Auction
Inner Join Bid B on B.Id_Bid = Itm.Id_Winning_Bid
And B.Id_Bidder = 27
Where Auc.DateTime_Auction > '2014-01-01';
Why Sql Server prefers to NOT use "IX_Bid_IdBid_IdBidder", and use this execution plan for Bid:
If I disable IX_Bid_IdBidder, and force it to use "IX_Bid_IdBid_IdBidder" everything mess up:
I can't understand why MSSQL prefers use 2 indexes, instead of only one that covers completely the query. My only guess is that's faster to use the ClusteredIndex, but I can't believe that it's faster than just use the Unique Composite Key of the other NonClustered Index.
Why?
Update:
As proposed by #Arvo, I changed the order of key columns of the "IX_Bid_IdBid_IdBidder", making the Id_Bidder first and Id_Bid second. Then, it become the preferred index. So, once again, why is MSSQL using the less selective "Index Key", instead of the most selective key? The Id_Bid is explicitly related in the inner join...
Old update:
I Updated the query, making it even more selective.
Also, I updated the index "IX_Bid_IdBid_IdBidder", to include Id_Auction_Item
Apologies:
The Index IX_Bid_IdAuctionItem_IdBidder is in fact IX_Bid_IdBid_IdBidder, that INCLUDES Id_Bid IN THE INDEX UNIQUE KEY!

A covering, correctly-sorted index is rarely not used by SQL Server. Only pathological cases come to mind such as extremely low page fullness or huge unneeded additional columns.
You index is simply not covering. Look at the columns that are output. You'll discover one that you have not indexed.
That column is Id_Auction_Item.

Ok, I think that after a lot of research (and learn a bit more about how joins really work behind the scenes) I figured it out.
By now, I'll post it only as a theory, til some SQL Master say that it's wrong and show me the light, or I really be sure I'm right.
The point is that MSSQL is choosing what is fastest to the whole query, and not only to the Bid table. So the analyzer have to choose to start from Auction table, or Bid table (because the conditions I specified. DateTime_Auction, and Id_Bidder).
In my (frivolous) mind, I thought the best execution plan will be starting from the Auction table:
Get Auctions that match the specified date >> Get Auctions_Items matching inner join with Auctions >> Get the Bids matching inner join with Auction_Item AND that have Id_Bidder matching the specified id
This will select a lot of rows in each "level"/nested loop, and only in the end use the specified index to exclude 90% of data.
Instead, MSSQL want to start with the minimal data set as possible. In this case, only the Bids of the specified bidder, since there is a lot of Auction Items that the bidder could simply don't participate. Doing this, each nested loop have its outer table shrunken compared with "my plan".
Get Bids of specified bidder >> inner join with Auction_Item >> excludes Auctions matching date.
If you pay attention to the very most at right nested loop, that I presume is the first nested loop, the Outer table of the loop is the preselected list of Bids of a Bidder using the appropriate index (IX_Bid_IdBidder), than execute a scan on the clustered index, and etc...
To make it even better, I included the columns that was in the "IX_Bid_IdBid_IdBidder" into "IX_Bid_IdBidder", and MSSQL doesn't need to execute a Key lookup on the PK_Bid.
There is a lot of Auction Items to each Auction, but only one Bid from the specified Bidder for each Auction Item, so the first nested loop will select the minimum of valid Auction Items we will need, that also will limit the Auctions we will to consider matching the Date. Thus, since we are starting from Bids, there is not a "list" of Id_Bids to limit, and then MSSQL cannot use the index "IX_Bid_IdBid_IdBidder" EVEN it covering all the fields of query. Thinking now, it seems a little obvious.
Anyway, Thanks for everybody that helped me!
My research:
http://sqlmag.com/database-performance-tuning/advanced-join-techniques (a little outdated...)
https://technet.microsoft.com/en-us/library/ms191426%28v=sql.105%29.aspx
https://technet.microsoft.com/en-us/library/ms191318%28v=sql.105%29.aspx
http://blogs.msdn.com/b/craigfr/archive/2006/07/26/679319.aspx
http://blogs.msdn.com/b/craigfr/archive/2009/03/18/optimized-nested-loops-joins.aspx

There's a lot of people out there who know a lot more about SQL Server than I do, but this sounds a lot like one of two possible problems:
First it could be that SQL Server is using outdated statistics to determine what's "most efficient", and because the statistics are wrong, it's picking the wrong index.
The second is a lot less likely, but bears mentioning. You've not mentioned stored procedures in your text, but if this is in a stored proc, SQL could be using a cached (and very wrong) execution plan - look up 'parameter sniffing' for more explanation on this topic.

Related

Simple Inner join suggesting an Include index

I have this simple inner join query and its execution plan master table has around 34K records and detail table has around 51K records. But this simple query is suggesting to add an index with include (containing all master columns that I included in the select). I wasn't expecting this what could be the reason and remedy.
DECLARE
#StartDrInvDate Date ='2017-06-01',
#EndDrInvDate Date='2017-08-31'
SELECT
Mastertbl.DrInvoiceID,
Mastertbl.DrInvoiceNo,
Mastertbl.DistributorInvNo,
PreparedBy,
detailtbl.BatchNo, detailtbl.Discount,
detailtbl.TradePrice, detailtbl.IssuedUnits,
detailtbl.FreeUnits
FROM
scmDrInvoices Mastertbl
INNER JOIN
scmDrInvoiceDetails detailtbl ON Mastertbl.DrInvoiceID = detailtbl.DrInvoiceID
WHERE
(Mastertbl.DrInvDate BETWEEN #StartDrInvDate AND #EndDrInvDate)
My real curiosity is why it is suggesting this index - I normally not see this behavior with larger tables
For this query:
SELECT m.DrInvoiceID, m.DrInvoiceNo, m.DistributorInvNo,
PreparedBy,
d.BatchNo, d.Discount, d.TradePrice, d.IssuedUnits, d.FreeUnits
FROM scmDrInvoices m INNER JOIN
scmDrInvoiceDetails d
ON m.DrInvoiceID = d.DrInvoiceID
WHERE m.DrInvDate BETWEEN #StartDrInvDate AND #EndDrInvDate;
I would expect the basic indexes to be: scmDrInvoices(DrInvDate, DrInvoiceID) and scmDrInvoiceDetails(DrInvoiceID). This index would allow the query engine to quickly identify the rows that match the WHERE in the master table and then look up the corresponding values in scmDrInvoiceDetails.
The rest of the columns could then be included in either index so the indexes would cover the query. "Cover" means that all the columns are in the index, so the query plan does not need to refer to the original data pages.
The above strategy is what SQL Server is suggesting.
You can perhaps see the logic of why it's suggesting to index the invoice date; it's done some calculation on the number of rows you want out of the number of rows it thinks there are currently, and it appears that the selectivity of an index on that column makes it worth indexing. If you want 3 rows out of 55,000, and you want it every 5 minutes forever, it makes sense to index. Especially if the growth rate of that table means that next year it'll be 3 rows out of 5.5 million.
The include recommendation is perhaps more naively recommending associating sufficient additional data with the indexed values such that the entire dataset demanded from the master table can be answered from the index, without hitting the table - indexes are essentially pointers to rows in a table; when the query engine has used the index to locate all the rows it will need, it then still needs to bash the table to actually get the data you want. By including data in an index you remove the need to go to the table and it's sensible sometimes, but not others (creating many indexes that essentially replicate most/all of a table data for seldom run queries is a waste of disk space).
Consider too, that the frequency with which you're running this query now, in a debug tool, is affecting SQLServer's opinion of how often the query is used. I routinely find my SQLAzure portal making index recommendations thanks to the devs running a query over and over, debugging it, when I actually know that in prod, that query will be used once a month, so I discard the recommendation to make an index that includes most the table, when the straight "index only the columns searched" will do fine, no include necessary
These recommendations thus shouldn't be blindly heeded as SQLServer cannot know what you intend to use this, or similar queries for in the real world applications. Index creation and maintenance should be done carefully and thoughtfully; for example it may be that this query is asking for this index, another query would want an index on a different column but it might make sense to create an index that keys on both columns (in a particular order) and then in whichever query searches on the column that is indexed second, include a predicate that hits the first indexed column regardless of whether the query needs it
Example, in your invoices table you have a column indicating whether it's paid or not, and somewhere else in your app you have another query that counts the number of unpaid invoices. You can either have 2 indexes - one on invoice date (for this query) and one on status (for that query) or one on both columns (status, date) and in this query have predicates of WHERE status = 'unpaid' AND date between... even though the status predicate is redundant. Why might it be redundant? Suppose you know you'll only ever be choosing invoices from last week that have not been sent out yet, so can only ever be unpaid.. This is what I mean by "be thoughtful about indexing" - you know lots about your app that SQLServer can never figure out.. By including the redundant status column in the "get invoices from last week" query (even though status is logically redundant) you allow the query engine to use an index that is ordered first by status, then by date. This means you can get away with having to only maintain one index, and it can be used by two queries
Index maintenance and logic of creation can be a full time job.. ;)

Why can't I simply add an index that includes all columns?

I have a table in SQL Server database which I want to be able to search and retrieve data from as fast as possible. I don't care about how long time it takes to insert into the table, I am only interested in the speed at which I can get data.
The problem is the table is accessed with 20 or more different types of queries. This makes it a tedious task to add an index specially designed for each query. I'm considering instead simply adding an index that includes ALL columns of the table. It's not something you would normally do in "good" database design, so I'm assuming there is some good reason why I shouldn't do it.
Can anyone tell me why I shouldn't do this?
UPDATE: I forgot to mention, I also don't care about the size of my database. It's OK that it means my database size will grow larger than it needed to
First of all, an index in SQL Server can only have at most 900 bytes in its index entry. That alone makes it impossible to have an index with all columns.
Most of all: such an index makes no sense at all. What are you trying to achieve??
Consider this: if you have an index on (LastName, FirstName, Street, City), that index will not be able to be used to speed up queries on
FirstName alone
City
Street
That index would be useful for searches on
(LastName), or
(LastName, FirstName), or
(LastName, FirstName, Street), or
(LastName, FirstName, Street, City)
but really nothing else - certainly not if you search for just Street or just City!
The order of the columns in your index makes quite a difference, and the query optimizer can't just use any column somewhere in the middle of an index for lookups.
Consider your phone book: it's order probably by LastName, FirstName, maybe Street. So does that indexing help you find all "Joe's" in your city? All people living on "Main Street" ?? No - you can lookup by LastName first - then you get more specific inside that set of data. Just having an index over everything doesn't help speed up searching for all columns at all.
If you want to be able to search by Street - you need to add a separate index on (Street) (and possibly another column or two that make sense).
If you want to be able to search by Occupation or whatever else - you need another specific index for that.
Just because your column exists in an index doesn't mean that'll speed up all searches for that column!
The main rule is: use as few indices as possible - too many indices can be even worse for a system than having no indices at all.... build your system, monitor its performance, and find those queries that cost the most - then optimize these, e.g. by adding indices.
Don't just blindly index every column just because you can - this is a guarantee for lousy system performance - any index also requires maintenance and upkeep, so the more indices you have, the more your INSERT, UPDATE and DELETE operations will suffer (get slower) since all those indices need to be updated.
You are having a fundamental misunderstanding how indexes work.
Read this explanation "how multi-column indexes work".
The next question you might have is why not creating one index per column--but that's also a dead-end if you try to reach top select performance.
You might feel that it is a tedious task, but I would say it's a required task to index carefully. Sloppy indexing strikes back, as in this example.
Note: I am strongly convinced that proper indexing pays off and I know that many people are having the very same questions you have. That's why I'm writing a the a free book about it. The links above refer the pages that might help you to answer your question. However, you might also want to read it from the beginning.
...if you add an index that contains all columns, and a query was actually able to use that index, it would scan it in the order of the primary key. Which means hitting nearly every record. Average search time would be O(n/2).. the same as hitting the actual database.
You need to read a bit lot about indexes.
It might help if you consider an index on a table to be a bit like a Dictionary in C#.
var nameIndex = new Dictionary<String, List<int>>();
That means that the name column is indexed, and will return a list of primary keys.
var nameOccupationIndex = new Dictionary<String, List<Dictionary<String, List<int>>>>();
That means that the name column + occupation columns are indexed. Now imagine the index contained 10 different columns, nested so far deep it contains every single row in your table.
This isn't exactly how it works mind you. But it should give you an idea of how indexes could work if implemented in C#. What you need to do is create indexes based on one or two keys that are queried on extensively, so that the index is more useful than scanning the entire table.
If this is a data warehouse type operation where queries are highly optimized for READ queries, and if you have 20 ways of dissecting the data, e.g.
WHERE clause involves..
Q1: status, type, customer
Q2: price, customer, band
Q3: sale_month, band, type, status
Q4: customer
etc
And you absolutely have plenty of fast storage space to burn, then by all means create an index for EVERY single column, separately. So a 20-column table will have 20 indexes, one for each individual column. I could probably say to ignore bit columns or low cardinality columns, but since we're going so far, why bother (with that admonition). They will just sit there and churn the WRITE time, but if you don't care about that part of the picture, then we're all good.
Analyze your 20 queries, and if you have hot queries (the hottest ones) that still won't go any faster, plan it using SSMS (press Ctrl-L) with one query in the query window. It will tell you what index can help that queries - just create it; create them all, fully remembering that this adds again to the write cost, backup file size, db maintenance time etc.
I think the questioner is asking
'why can't I make an index like':
create index index_name
on table_name
(
*
)
The problems with that have been addressed.
But given it sounds like they are using MS sql server.
It's useful to understand that you can include nonkey columns in an index so they the values of those columns are available for retrieval from the index, but not to be used as selection criteria :
create index index_name
on table_name
(
foreign_key
)
include (a,b,c,d) -- every column except foreign key
I created two tables with a million identical rows
I indexed table A like this
create nonclustered index index_name_A
on A
(
foreign_key -- this is a guid
)
and table B like this
create nonclustered index index_name_B
on B
(
foreign_key -- this is a guid
)
include (id,a,b,c,d) -- ( every key except foreign key)
no surprise, table A was slightly faster to insert to.
but when I and ran these this queries
select * from A where foreign_key = #guid
select * from B where foreign_key = #guid
On table A, sql server didn't even use the index, it did a table scan, and complained about a missing index including id,a,b,c,d
On table B, the query was over 50 times faster with much less io
forcing the query on A to use the index didn't make it any faster
select * from A where foreign_key = #guid
select * from A with (index(index_name_A)) where foreign_key = #guid
I'm considering instead simply adding an index that includes ALL columns of the table.
This is always a bad idea. Indexes in database is not some sort of pixie dust that works magically. You have to analyze your queries and according to what and how is being queried - append indexes.
It is not as simple as "add everything to index and have a nap"
I see only long and complicated answers here so I thought I should give the simplest answer possible.
You cannot add an entire table, or all its columns, to an index because that just duplicates the table.
In simple terms, an index is just another table with selected data ordered in the order you normally expect to query it in, and a pointer to the row on disk where the rest of the data lives.
So, a level of indirection exists. You have a partial copy of a table in an preordered manner (both on disk and in RAM, assuming the index is not fragmented), which is faster to query for the columns defined in the index only, while the rest of the columns can be fetched without having to scan the disk for them, because the index contains a reference to the correct position on disk where the rest of the data is for each row.
1) size, an index essentially builds a copy of the data in that column some easily searchable structure, like a binary tree (I don't know SQL Server specifcs).
2) You mentioned speed, index structures are slower to add to.
That index would just be identical to your table (possibly sorted in another order).
It won't speed up your queries.

Table index design

I would like to add index(s) to my table.
I am looking for general ideas how to add more indexes to a table.
Other than the PK clustered.
I would like to know what to look for when I am doing this.
So, my example:
This table (let's call it TASK table) is going to be the biggest table of the whole application. Expecting millions records.
IMPORTANT: massive bulk-insert is adding data in this table
table has 27 columns: (so far, and counting :D )
int x 9 columns = id-s
varchar x 10 columns
bit x 2 columns
datetime x 5 columns
INT COLUMNS
all of these are INT ID-s but from tables that are usually smaller than Task table (10-50 records max), example: Status table (with values like "open", "closed") or Priority table (with values like "important", "not so important", "normal")
there is also a column like "parent-ID" (self - ID)
join: all the "small" tables have PK, the usual way ... clustered
STRING COLUMNS
there is a (Company) column (string!) that is something like "5 characters long all the time" and every user will be restricted using this one. If in Task there are 15 different "Companies" the logged in user would only see one. So there's always a filter on this one. Might be a good idea to add an index to this column?
DATE COLUMNS
I think they don't index these ... right? Or can / should be?
I wouldn't add any indices - unless you have specific reasons to do so, e.g. performance issues.
In order to figure out what kind of indices to add, you need to know:
what kind of queries are being used against your table - what are the WHERE clauses, what kind of ORDER BY are you doing?
how is your data distributed? Which columns are selective enough (< 2% of the data) to be useful for indexing
what kind of (negative) impact do additional indices have on your INSERTs and UPDATEs on the table
any foreign key columns should be part of an index - preferably as the first column of the index - to speed up JOINs to other tables
And sure you can index a DATETIME column - what made you think you cannot?? If you have a lot of queries that will restrict their result set by means of a date range, it can make total sense to index a DATETIME column - maybe not by itself, but in a compound index together with other elements of your table.
What you cannot index are columns that hold more than 900 bytes of data - anything like VARCHAR(1000) or such.
For great in-depth and very knowledgeable background on indexing, consult the blog by Kimberly Tripp, Queen of Indexing.
in general an index will speed up a JOIN, a sort operation and a filter
SO if the columns are in the JOIN, the ORDER BY or the WHERE clause then an index will help in terms of performance...but there is always a but...with every index that you add UPDATE, DELETE and INSERT operations will be slowed down because the indexes have to be maintained
so the answer is...it depends
I would say start hitting the table with queries and look at the execution plans for scans, try to make those seeks by either writing SARGable queries or adding indexes if needed...don't just add indexes for the sake of adding indexes
Step one is to understand how the data in the table will be used: how will it be inserted, selected, updated, deleted. Without knowing your usage patterns, you're shooting in the dark. (Note also that whatever you come up with now, you may be wrong. Be sure to compare your decisions with actual usage patterns once you're up and running.) Some ideas:
If users will often be looking up individual items in the table, an index on the primary key is critical.
If data will be inserted with great frequency and you have multiple indexes, over time you well have to deal with index fragmentation. Read up on and understand clustered and non-clustered indexes and fragmentation (ALTER INDEX...REBUILD).
But, if performance is key in situations when you need to retrieve a lot of rows, you might consider using your clustered indexe to support that.
If you often want a set of data based on Status, indexing on that column can be good--particularly if 1% of your rows are "Active" vs. 99% "Not Active", and all you want are the active ones.
Conversely, if your "PriorityId" is only used to get the "label" stating what PriorityId 42 is (i.e. join into the lookup table), you probably don't need an index on it in your main table.
A last idea, if everyone will always retrieve data for only one Company at a time, then (a) you'll definitely want to index on that, and (b) you might want to consider partitioning the table on that value, as it can act as a "built in filter" above and beyond conventional indexing. (This is perhaps a bit extreme and it's only available in Enterprise edition, but it may be worth it in your case.)

Making a more efficient join

Here's my query, it is fairly straightforward:
SELECT
INVOICE_ITEMS.II_IVNUM, INVOICE_ITEMS.IIQSHP
FROM
INVOICE_ITEMS
LEFT JOIN
INVOICES
ON
INVOICES.INNUM = INVOICE_ITEMS.II_INNUM
WHERE
INVOICES.IN_DATE
BETWEEN
'2010-08-29' AND '2010-08-30'
;
I have very limited knowledge of SQL, but I'm trying to understand some of the concepts like subqueries and the like. I'm not looking for a redesign of this code, but rather an explanation of why it is so slow (600+ seconds on my test database) and how I can make it faster.
From my understanding, the left join is creating a virtual table and populating it with every result row from the join, meaning that it is processing every row. How would I stop the query from reading the table completely and just finding the WHERE/BETWEEN clause first, then creating a virtual table after that (if it is possible)?
How is my logic? Are there any consistently recommended resources to get me to SQL ninja status?
Edit: Thanks everyone for the quick and polite responses. Currently, I'm connecting over ODBC to a proprietary database that is used in the rapid application development framework called OMNIS. Therefore, I really have no idea what sort of optimization is being run, but I believe it is based loosely on MSSQL.
I would rewrite it like this, and make sure you have an index on i.INNUM, ii.INNUM, and i.IN_DATE. The LEFT JOIN is being turned into an INNER JOIN by your WHERE clause, so I rewrote it as such:
SELECT ii.II_IVNUM, ii.IIQSHP
FROM INVOICE_ITEMS ii
INNER JOIN INVOICES i ON i.INNUM = ii.II_INNUM
WHERE i.IN_DATE BETWEEN '2010-08-29' AND '2010-08-30'
Depending on what database you are using, what may be happening is all of the records from INVOICE_ITEMS are being joined (due to the LEFT JOIN), regardless of whether there is a match with INVOICE or not, and then the WHERE clause is filtering down to the ones that matched that had a date within range. By switching to an INNER JOIN, you may make the query more efficient, by only needing to apply the WHERE clause to INVOICES records that have a matching INVOICE_ITEMS record.
SInce that is a very basic query the optimizer should do fine with it, likely your problem would be incorrect indexing. DO you haveindexes on the In_date field and INVOICE_ITEMS.II_INNUM field? If you have properly set up PK Fk relationships, INVOICES.INNUM should already be indexed but FKs are not indexed automatically.
Your query is fine, it's the indexes you have to look at.
Are INVOICES.INNUM and INVOICE_ITEMS.II_INNUM indexed?
If not SQL has to do something called a 'scan' - it searches every single record.
You can think of indexes as like the tabs on the side of a phone book - you know where to start looking for people based on the first letters of their surname. Without an index (say you want to look for names that end in '...son') you have to search the entire book.
There are different types of index - they can be ordered (like the phone book index - all ordered by surname) or not (like the index at the back of a book - there's an overhead in finding the index and then the actual page).
You should also be able to view the query plan - this is how the server executes the SQL statement. That can tell you all sorts of more advanced stuff - for instance there are multiple ways to do the job: a merge join is possible if both tables are sorted by the join field or a nested join will loop through the smaller table for every record in the larger table.
well there is no reason why this query is slow... the only thing that comes to mind is, do you have indexes on INVOICES.INNUM = INVOICE_ITEMS.II_INNUM? if you add them it could speed up the select but it would slow down updates/inserts...
A join doesn't create a "virtual table" on anything more than just a conceptual level.
The performance issue with your query most likely lies in poor or insufficient indexing. You should have indexes on:
INVOICE_ITEMS.II_INNUM
INVOICES.IN_DATE
You should also have an index on INVOICES.INNUM, but if that's the primary key of the table then it already has one.
Also, don't use a left join here. If there's a foreign key between INVOICE_ITEMS.II_INNUM and INVOICES.INNUM (and INVOICE_ITEMS.II_INNUM is not nullable), then you'll never encounter a record in INVOICE_ITEMS that won't match up to a record in INVOICES. Even if there were, your WHERE condition is using a value from INVOICES, so you'd eliminate any unmatched rows anyway. Just use a regular JOIN.

faster way to use sets in MySQL

I have a MySQL 5.1 InnoDB table (customers) with the following structure:
int record_id (PRIMARY KEY)
int user_id (ALLOW NULL)
varchar[11] postcode (ALLOW NULL)
varchar[30] region (ALLOW NULL)
..
..
..
There are roughly 7 million rows in the table. Currently, the table is being queried like this:
SELECT * FROM customers WHERE user_id IN (32343, 45676, 12345, 98765, 66010, ...
in the actual query, currently over 560 user_ids are in the IN clause. With several million records in the table, this query is slow!
There are secondary indexes on table, the first of which being on user_id itself, which I thought would help.
I know that SELECT(*) is A Bad Thing and this will be expanded to the full list of fields required. However, the fields not listed above are more ints and doubles. There are another 50 of those being returned, but they are needed for the report.
I imagine there's a much better way to access the data for the user_ids, but I can't think how to do it. My initial reaction is to remove the ALLOW NULL on the user_id field, as I understand NULL handling slows down queries?
I'd be very grateful if you could point me in a more efficient direction than using the IN ( ) method.
EDIT
Ran EXPLAIN, which said:
select_type = SIMPLE
table = customers
type = range
possible_keys = userid_idx
key = userid_idx
key_len = 5
ref = (NULL)
rows = 637640
Extra = Using where
does that help?
First, check if there is an index on USER_ID and make sure it's used.
You can do it with running EXPLAIN.
Second, create a temporary table and use it in a JOIN:
CREATE TABLE temptable (user_id INT NOT NULL)
SELECT *
FROM temptable t
JOIN customers c
ON c.user_id = t.user_id
Third, how may rows does your query return?
If it returns almost all rows, then it just will be slow, since it will have to pump all these millions over the connection channel, to begin with.
NULL will not slow your query down, since the IN condition only satisfies non-NULL values which are indexed.
Update:
The index is used, the plan is fine except that it returns more than half a million rows.
Do you really need to put all these 638,000 rows into the report?
Hope its not printed: bad for rainforests, global warming and stuff.
Speaking seriously, you seem to need either aggregation or pagination on your query.
"Select *" is not as bad as some people think; row-based databases will fetch the entire row if they fetch any of it, so in situations where you're not using a covering index, "SELECT *" is essentially no slower than "SELECT a,b,c" (NB: There is sometimes an exception when you have large BLOBs, but that is an edge-case).
First things first - does your database fit in RAM? If not, get more RAM. No, seriously. Now, suppose your database is too huge to reasonably fit into ram (Say, > 32Gb) , you should try to reduce the number of random I/Os as they are probably what's holding things up.
I'll assuming from here on that you're running proper server grade hardware with a RAID controller in RAID1 (or RAID10 etc) and at least two spindles. If you're not, go away and get that.
You could definitely consider using a clustered index. In MySQL InnoDB you can only cluster the primary key, which means that if something else is currently the primary key, you'll have to change it. Composite primary keys are ok, and if you're doing a lot of queries on one criterion (say user_id) it is a definite benefit to make it the first part of the primary key (you'll need to add something else to make it unique).
Alternatively, you might be able to make your query use a covering index, in which case you don't need user_id to be the primary key (in fact, it must not be). This will only happen if all of the columns you need are in an index which begins with user_id.
As far as query efficiency is concerned, WHERE user_id IN (big list of IDs) is almost certainly the most efficient way of doing it from SQL.
BUT my biggest tips are:
Have a goal in mind, work out what it is, and when you reach it, stop.
Don't take anybody's word for it - try it and see
Ensure that your performance test system is the same hardware spec as production
Ensure that your performance test system has the same data size and kind as production (same schema is not good enough!).
Use synthetic data if it is not possible to use production data (Copying production data may be logistically difficult (Remember your database is >32Gb) ; it may also violate security policies).
If your query is optimal (as it probably already is), try tuning the schema, then the database itself.
Is this your most important query? Is this a transactional table?
If so, try creating a clustered index on user_id. Your query might be slow because it still must make random disk reads to retrieve the columns (key lookups), even after finding the records that match (index seek on the user_Id index).
If you cannot change the clustered index, then you might want to consider an ETL process (simplest is a trigger that inserts into another table with the best indexing). This should yield faster results.
Also note that such large queries may take some time to parse, so help it out by putting the queried ids into a temp table if possibl
Are they the same ~560 id's every time? Or is it a different ~500 ids on different runs of the queries?
You could just insert your 560 UserIDs into a separate table (or even a temp table), stick an index on the that table and inner join it to you original table.
You can try to insert the ids you need to query on in a temp table and inner join both tables. I don't know if that would help.