I have a TSQL select query that is running "slow"
SELECT
CustomerKey
,ProductKey
,RepresentativeKey
,ReportingDateKey
,SUM(i.InvoiceQuantity) AS InvoiceQuantity
,SUM(i.InvoiceQuantityKg) AS InvoiceQuantityKg
,SUM(i.BrutoInvoiceLineAmount) AS BrutoInvoiceLineAmount
,SUM(i.EndOfYearDiscount) AS EndOfYearDiscount
,SUM(i.NettoInvoiceLineAmount) AS NettoInvoiceLineAmount
,SUM(i.TotalLineCostPrice) AS CostPrice
,SUM(i.MarginAmount) AS MarginAmount
FROM FactInvoices i
WHERE
i.DossierKey =2
AND i.ReportingDate BETWEEN '2016-01-01' AND '2017-12-31'
GROUP BY
CustomerKey
,ProductKey
,RepresentativeKey
,ReportingDateKey
I'm running the query in SSMS 32bit.
Execution time is 17-21s, I have tested adding non clustered indexs on DossierKey and ReportingDate, but this is only slowing down the query.
The table has about 6.04M record and this result set is giving back 1M records.
It's running on SQL 2016 Developers edition.
Server specs: 8core 16gb ram and HDD => Virual server.
Looking at the execution plan, I can't find any improvements.
How do I speed up? More hardware? But I don't think that will help because the server is not fully used when running this query.
Edit:
Execution Plan:
Index:
CREATE NONCLUSTERED INDEX [_dx1]
ON [dbo].[FactInvoices] ([DossierKey],[ReportingDate])
INCLUDE ([CustomerKey],[ProductKey],[ReportingDateKey],[RepresentativeKey],[InvoiceQuantity],[InvoiceQuantityKg],[BrutoInvoiceLineAmount],[NettoInvoiceLineAmount],[MarginAmount],[EndOfYearDiscount],[TotalLineCostPrice])
Thanks.
For this query:
SELECT CustomerKey, ProductKey, RepresentativeKey, ReportingDateKey,
. . .
FROM FactInvoices i
WHERE i.DossierKey = 2 AND
i.ReportingDate BETWEEN '2016-01-01' AND '2017-12-31'
GROUP BY CustomerKey, ProductKey, RepresentativeKey, ReportingDateKey;
I would recommend an index on FactInvoices(DossierKey, ReportingDate, CustomerKey, ProductKey, RepresentativeKey). The first two are the primary elements of the index used for the WHERE clause. The remaining three columns may be useful for the aggregation. You could also include all the additional columns used in the query.
This is an article I wrote on speeding up a query.
If your query is slow, you can check the execution plan for possible areas of speed up.
Well, I have done that and find that it does not always help. The same execution plan can take seconds to run or go off into never never land and be killed after 7 minutes.
I solved this recently using a variety of techniques that I haven't seen mentioned in one place before and wanted to help anyone else in the same situation. The solution usually returned within 2 seconds.
Here is what I did.
Starting Query
This is a fairly basic query. It reports sales orders and allows the user to specify up to 6 optional where criteria.
• If the user does not enter a criteria for a value, for example Country, its criteria string is set to '' and Country is not checked.
• If the user does enter a criteria for a value, its criteria string is bracketed by '%..%'. For example, if the user enters 'Tin', strCountry is set to '%Tin%' and all countries with 'Tin' in its name are selected. (Argentina and Martinique for example.)
SELECT Top 1000
SalesHeader.InvoiceNumber
,SalesHeader.CompanyName
,SalesHeader.Street
,SalesHeader.City
,SalesHeader.Region
,SalesHeader.Country
,SalesHeader.SalesDate
,SalesHeader.InvoiceTotal
,SalesLineItem.LineItemNbr
,SalesLineItem.PartNumber
,SalesLineItem.Quantity
,SalesLineItem.UnitPrice
,SalesLineItem.Quantity * SalesLineItem.UnitPrice as ExtPrice
,PartMaster.UnitWeight
,SalesLineItem.Quantity * PartMaster.UnitWeight as ExtWeight
FROM dbo.SalesHeader
left join dbo.SalesLineItem on SalesHeader.InvoiceNumber = SalesLineItem.InvoiceNumber
left join dbo.PartMaster on SalesLineItem.PartNumber = PartMaster.PartNumber
where
(#strCountry = '' or Country like #strCountry)
and
(#strCompanyName = '' or CompanyName like #strCompanyName)
and
(#strPartNumber = '' or SalesLineItem.PartNumber like #strPartNumber)
and
(#strInvoiceNumber = '' or SalesHeader.InvoiceNumber like #strInvoiceNumber)
and
(#strRegion = '' or Region like #strRegion)
and
(#mnyExtPrice = 0 or (SalesLineItem.Quantity * SalesLineItem.UnitPrice) > #mnyExtPrice)
Order By
InvoiceNumber,
Region,
ExtPrice
I am taking this from a data warehouse I worked on. There were 260,000 records in the full query. We limited the return to 1,000 records as a user would never want more than that.
Sometimes the query would take 10 seconds or less and sometimes we would have to kill it after over 7 minutes had gone by. A user is not going to wait 7 minutes.
What We Came Up With
There are different techniques to speed up a query. The following is our resulting query. I go over each of the techniques used below.
This new query generally returned results in 2 seconds or less.
SELECT
InvoiceNumber
,Company
,Street
,City
,Region
,Country
,SalesDate
,InvoiceTotal
,LineItemNbr
,PartNumber
,Quantity
,UnitPrice
,ExtPrice
,UnitWeight
,ExtWeight
FROM
(
SELECT top 1000
IdentityID,
ROW_NUMBER() OVER (ORDER BY [SalesDate], [Country], [Company], [PartNumber]) as RowNbr
FROM dbo.SalesCombined with(index(NCI_SalesDt))
where
(#strCountry = '' or Country like #strCountry)
and
(#strCompany = '' or Company like #strCompany)
and
(#strPartNumber = '' or PartNumber like #strPartNumber)
and
(#strInvoiceNumber = '' or InvoiceNumber like #strInvoiceNumber)
and
(#strRegion = '' or Region like #strRegion)
and
(#mnyExtPrice = 0 or ExtPrice > #mnyExtPrice)
) SubSelect
Inner Join dbo.SalesCombined on SubSelect.IdentityID = SalesCombined.IdentityID
Order By
RowNbr
Technique 1 - Denormalize the data.
I was fortunate in two ways:
• The data was small enough to create a second copy of it.
• The data did not change very often. This meant I could structure the second copy optimized for querying and allow updating to take a while.
The SalesHeader, SalesLineItem and PartMaster tables were merged into the single SalesCombined table.
The calculated values were stored in the SalesCombined table as well.
Note that I left the original tables in place. All code to update those tables was still valid. I had to create additional code to then propagate the changes to the SalesCombined table.
Technique 2 - Created An Integer Identity Value
The first field of this denormalized table is an integer identity value. This was called IdentityID.
Even if we had not denormalized the data, an integer identity value in SalesHeader could have been used for the join between it and SalesLineItem and speeded the original query up a bit.
Technique 3 - Created A Clustered Index On This Integer Identity Value
I created a clustered index on this IdentityID value. This is the fastest way to find a record.
Technique 4 - Created A Unique, Non-Clustered Index On The Sort Fields
The query's output is sorted on four fields, SalesDate, Country, Company, PartNumber. So I created an index on these fields SalesDate, Country, Company and PartNumber.
Then I added the IdentityID to this index. This index was noted as Unique. This allowed SQL Server to go from the sort fields to the address, essentially, of the actual record as quickly as possible.
Technique 5: Include In the Non-Clustered Index All 'Where Clause' Fields
A SQL Server index can include fields that are not part of the sort. (Who thought of this? It's a great idea.) If you include all where clause fields in the index, SQL Server does not have to look up the actual record to obtain this data.
This is the normal look up process:
1) Read the index from disk.
2) Go to the first entry on the index.
3) Find the address of the first record from that entry.
4) Read that record from disk.
5) Find any fields that are part of the where clause and apply the criteria.
6) Decide if that record is included in the query.
If you include the where clause fields in the index:
1) Read the index from disk.
2) Go to the first entry on the index.
3) Find any fields that are part of the where clause (stored in the index) and apply the criteria.
4) Decide if that record is included in the query.
CREATE UNIQUE NONCLUSTERED INDEX [NCI_InvcNbr] ON [dbo].[SalesCombined]
(
[SalesDate] ASC,
[Country] ASC,
[CompanyName] ASC,
[PartNumber] ASC,
[IdentityID] ASC
)
INCLUDE [InvoiceNumber],
[City],
[Region],
[ExtPrice]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
ON [PRIMARY]
The execution plan for the original query.
Click Here To See Original Query Execution Plan
The execution plan for our final query is much simpler - to start, it just reads the index.
Click Here To See Final Query Execution Plan
Technique 6: Created A Sub-Query To Find The IdentityID Of Each Record To Output And Their Sort Order
I created a sub-query to find the records to output and the order in which to output them. Note the following:
• Technique 7 - It explicitly says to use the NCI_InvcNbr index that has all of the fields needed in it.
• Technique 8- It uses the Row_Number function to generate an integer for each row that will be output. These values are generated 1, 2 and 3 in the order given by the fields in the ORDER BY section of that line.
Technique 9: Create An Enclosing Query With All Of The Values
This query specifies the values to print. It uses the Row_Number values to know the order in which to print. Note that the inner join is done on the IdentityID field which uses the Clustered index to find each record to print.
Techniques That Did Not Help
There were two techniques that we tried that did not speed up the query. These statements are both added to the end of a query.
• OPTION (MAXDOP 1) limits the number of processors to one. This will prevent any parallelism from being done. We tried this when we were experimenting with the query and had parallelism in the execution plan.
• OPTION (RECOMPILE) causes the execution plan to be recreated every time the query is run. This can be useful when different user selections can vary the query results.
Hope this can be of use.
If you already indexed this query and it still with a bad performance, you can try partition your table by DossierKey.
and change
WHERE i.DossierKey = 2
to
WHERE $PARTITION.partition_function_name( 2)
https://www.cathrinewilhelmsen.net/2015/04/12/table-partitioning-in-sql-server/
https://learn.microsoft.com/en-us/sql/t-sql/functions/partition-transact-sql
Related
I have a query running against a SQL Server database that is taking over 10 seconds to execute. The table being queried has over 14 million rows.
I want to display the Text column from a Notes table by a given ServiceUserId in date order. There could be thousands of entries so I want to limit the returned values to a manageable level.
SELECT Text
FROM
(SELECT
ROW_NUMBER() OVER (ORDER BY [DateDone]) AS RowNum, Text
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2') AS RowConstrainedResult
WHERE
RowNum >= 40 AND RowNum < 60
ORDER BY
RowNum
Below is the execution plan for the above query.
Nonclustered Index - nonclustered index on the ServiceUserId and DateDone columns in ascending order.
Key lookup - Primary key for the table which is the NoteId
If I run the same query a second time but with different row numbers then I get a response in milliseconds, I assume from a cached execution plan. The query ran for a different ServiceUserId will take ~10 seconds though.
Any suggestions for how to speed up this query?
You should look into Keyset Pagination.
It is far more performant than Rowset Pagination.
It differs fundamentally from it, in that instead of referencing a particular block of row numbers, instead you reference starting point to lookup the index key.
The reason it is much faster is that you don't care about how many rows are before a particular key, you just seek a key and move forward (or backward).
Say you are filtering by a single ServiceUserId, ordering by DateDone. You need an index as follows (you could leave out the INCLUDE if it's too big, it doesn't change the maths very much):
create index IX_DateDone on Notes (ServiceUserId, DateDone) INCLUDE (TEXT);
Now, when you select some rows, instead of giving the start and end row numbers, give the starting key:
SELECT TOP (20)
Text,
DateDone
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
AND DateDone > #startingDate
ORDER BY
DateDone;
On the next run, you pass the last DateDone value you received. This gets you the next batch.
The one small downside is that you cannot jump pages. However, it is much rarer than some may think (from a UI perspective) for a user to want to jump to page 327. So that doesn't really matter.
The key must be unique. If it is not unique you can't seek to exactly the next row. If you need to use an extra column to guarantee uniqueness, it gets a little more complicated:
WITH NotesFiltered AS
(
SELECT * FROM Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
)
SELECT TOP (20)
Text,
DateDone
FROM (
SELECT
Text,
DateDone,
0 AS ordering
FROM NotesFiltered
WHERE
DateDone = #startingDate AND NoteId > #startingNoteId
UNION ALL
SELECT
Text,
DateDone,
1 AS ordering
FROM NotesFiltered
WHERE
DateDone > #startingDate
) n
ORDER BY
ordering, DateDone, NoteId;
Side Note
In RDBMSs that support row-value comparisons, the multi-column example could be simplified back to the original code by writing:
WHERE (DateDone, NoteId) > (#startingDate, #startingNoteId)
Unfortunately SQL Server does not support this currently.
Please vote for the Azure Feedback request for this
I would suggest to use order by offset fetch :
it starts from row no x and fetch z next row, which can be parameterized
SELECT
Text
FROM
Notes
WHERE
ServiceUserId = '6D33B91A-1C1D-4C99-998A-4A6B0CC0A6C2'
Order by DateDone
OFFSET 40 ROWS FETCH NEXT 20 ROWS ONLY
also make sure you have proper index for "DateDone" , maybe include it in the index you already have on "Notes" if you have not yet
you may need to include text column to you index :
create index IX_DateDone on Notes(DateDone) INCLUDE (TEXT,ServiceUserId)
however be noticed that adding such huge column to the index will effect your insert/update efficiency and of course It will need disk space
SQL Fiddle: http://sqlfiddle.com/#!3/23cf8
In this query, when I have an In clause on an Id, and then also select other columns, the In is evaluated first, and then the Details column and other columns are pulled in via a RID Lookup:
--In production and in SQL Fiddle, Details is grabbed via a RID Lookup after the In clause is evaluated
SELECT [Id]
,[ForeignId]
,Details
--Generate a numbering(starting at 1)
--,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where foreignId In (1,2,3,5)
With this query, the Details are being pulled in via a Table Scan.
With NumberedContacts AS
(
SELECT [Id]
,[ForeignId]
--Generate a numbering(starting at 1)
,Row_Number() Over(Partition By ForeignId Order By Id Desc) as ContactNumber --Desc because older posts should be numbered last
FROM SupportContacts
Where ForeignId In (1,2,3,5)
)
Select nc.[Id]
,nc.[ForeignId]
,sc.[Details]
From NumberedContacts nc
Inner Join SupportContacts sc on nc.Id = sc.Id
Where nc.ContactNumber <= 2 --Only grab the last 2 contacts per ForeignId
;
In SqlFiddle, the second query actually gets a RID Lookup, whereas in production with a million records it produces a Table Scan (the IN clause eliminates 99% of the rows)
Otherwise the query plan shown in SQL Fiddle is identical, the only difference being that for the second query the RID Lookup in SQL Fiddle, is a Table Scan in production :(
I would like to understand possibilities that would cause this behavior? What kinds of things would you look at to help determine the cause of it using a table scan here?
How can I influence it to use a RID Lookup there?
From looking at operation costs in the actual execution plan, I believe I can get the second query very close in performance to the first query if I can get it to use a RID Lookup. If I don't select the Detail column, then the performance of both queries is very close in production. It is only after adding other columns like Detail that performance degrades significantly for the second query. When I put it in SQL Fiddle and saw that the execution plan used an RID Lookup, I was surprised but slightly confused...
It doesn't have a clustered index because in testing with different clustered indexes, there was slightly worse performance for this and other queries. That was before I began adding other columns like Details though, and I can experiment with that more, but would like to have a understanding of what is going on now before I start shooting in the dark with random indexes.
What if you would change your main index to include the Details column?
If you use:
CREATE NONCLUSTERED INDEX [IX_SupportContacts_ForeignIdAsc_IdDesc]
ON SupportContacts ([ForeignId] ASC, [Id] DESC)
INCLUDE (Details);
then neither a RID lookup nor a table scan would be needed, since your query could be satisfied from just the index itself....
The differences in the query plans will be dependent on the types of indexes that exist and the statistics of the data for those tables in the different environments.
The optimiser uses the statistics (histograms of data frequency, mostly) and the available indexes to decide which execution plan is going to be the quickest.
So, for example, you have noticed that the performance degrades when the 'Details' column is included. This is an almost sure sign that either the 'Details' column is not part of an index, or if it is part of an index, the data in that column is mostly unique such that the index accesses would be equivalent (or almost equivalent) to a table scan.
Often when this situation arises, the optimiser will choose a table scan over the index access, as it can take advantage of things like block reads to access the table records faster than perhaps a fragmented read of an index.
To influence the path that will be chose by the optimiser, you would need to look at possible indexes that could be added/modified to make an index access more efficient, but this should be done with care as it can adversely affect other queries as well as possibly degrading insert performance.
The other important activity you can do to help the optimiser is to make sure the table statistics are kept up to date and refreshed at a frequency that is appropriate to the rate of change of the frequency distribution in the table data
If it's true that 99% of the rows would be omitted if it performed the query using the relevant index + RID then the likeliest problem in your production environment is that your statistics are out of date and the optimiser doesn't realise that ForeignID in (1,2,3,5) would limit the result set to 1% of the total data.
Here's a good link for discovering more about statistics from Pinal Dave: http://blog.sqlauthority.com/2010/01/25/sql-server-find-statistics-update-date-update-statistics/
As for forcing the optimiser to follow the correct path WITHOUT updating the statistics, you could use a table hint - if you know the index that your plan should be using which contains the ID and ForeignID columns then stick that in your query as a hint and force SQL optimiser to use the index:
http://msdn.microsoft.com/en-us/library/ms187373.aspx
FYI, if you want the best performance from your second query, use this index and avoid the headache you're experiencing altogether:
create index ix1 on SupportContacts(ForeignID, Id DESC) include (Details);
I have the very table in sql server 2008 with lot of data
|ID|Name|Column_1|Column_2|
|..|....|........|........|
more than 18,000 records. So i need to the the row with the lowest value of Column_1 that is date but could by any data type (that is unsorted) so I use these sentence
SELECT TOP 1 ID, Name from table ORDER BY Column_1 ASC
But this is very very slow. And i think that i don't need to to sort the whole table. My question es how to get the same date with out using TOP 1 and ORDER BY
I cannot see why 18,000 rows of information would cause too much of a slow down, but that is obviously without seeing what the data is you are storing.
If you are regularly going to be using the Column_1 field, then I would suggest you place a non-clustered index on it... that will speed up your query.
You can do it by "designing" your table via Sql Server Management Studio, or directly via TSQL...
CREATE INDEX IX_myTable_Column_1 ON myTable (Column_1 ASC)
More information on MSDN about creating indexes here
Update thanks to comments by #GarethD who helped me with this, as I wasn't actually aware of it.
As an extra part of the above TSQL statement, it will increase the speed of your queries if you include the names of the other columns that will be used within the index....
CREATE INDEX IX_myTable_Column_1 ON myTable (Column_1 ASC) INCLUDE (ID, Name)
As GarethD points out, using this SQLFiddle as proof, the execution plan is much quicker as it avoids a "RID" (or Row Identifier) lookup.
More information on MSDN about creating indexes with include columns here
Thank you #GarethD
Would this work faster? When I read this question, this was the code that came to mind:
Select top 1 ID, Name
from table
where Column_1 = (Select min(Column_1) from table)
The Product table has 700K records in it. The query:
SELECT TOP 1 ID,
Name
FROM Product
WHERE contains(Name, '"White Dress"')
ORDER BY DateMadeNew desc
takes about 1 minute to run. There is an non-clustered index on DateMadeNew and FreeText index on Name.
If I remove TOP 1 or Order By - it takes less then 1 second to run.
Here is the link to execution plan.
http://screencast.com/t/ZDczMzg5N
Looks like FullTextMatch has over 400K executions. Why is this happening? How can it be made faster?
UPDATE 5/3/2010
Looks like cardinality is out of whack on multi word FreeText searches:
Optimizer estimates that there are 28K records matching 'White Dress', while in reality there is only 1.
http://screencast.com/t/NjM3ZjE4NjAt
If I replace 'White Dress' with 'White', estimated number is '27,951', while actual number is '28,487' which is a lot better.
It seems like Optimizer is using only the first word in phrase being searched for cardinality.
Looks like FullTextMatch has over 400K executions. Why is this happening?
Since you have an index combined with TOP 1, optimizer thinks that it will be better to traverse the index, checking each record for the entry.
How can it be made faster?
If updating the statistics does not help, try adding a hint to your query:
SELECT TOP 1 *
FROM product pt
WHERE CONTAINS(name, '"test1"')
ORDER BY
datemadenew DESC
OPTION (HASH JOIN)
This will force the engine to use a HASH JOIN algorithm to join your table and the output of the fulltext query.
Fulltext query is regarded as a remote source returning the set of values indexed by KEY INDEX provided in the FULLTEXT INDEX definition.
Update:
If your ORM uses parametrized queries, you can create a plan guide.
Use Profiler to intercept the query that the ORM sends verbatim
Generate a correct plan in SSMS using hints and save it as XML
Use sp_create_plan_guide with an OPTION USE PLAN to force the optimizer always use this plan.
Edit
From http://technet.microsoft.com/en-us/library/cc721269.aspx#_Toc202506240
The most important thing is that the
correct join type is picked for
full-text query. Cardinality
estimation on the FulltextMatch STVF
is very important for the right plan.
So the first thing to check is the
FulltextMatch cardinality estimation.
This is the estimated number of hits
in the index for the full-text search
string. For example, in the query in
Figure 3 this should be close to the
number of documents containing the
term ‘word’. In most cases it should
be very accurate but if the estimate
was off by a long way, you could
generate bad plans. The estimation for
single terms is normally very good,
but estimating multiple terms such as
phrases or AND queries is more complex
since it is not possible to know what
the intersection of terms in the index
will be based on the frequency of the
terms in the index. If the cardinality
estimation is good, a bad plan
probably is caused by the query
optimizer cost model. The only way to
fix the plan issue is to use a query
hint to force a certain kind of join
or OPTIMIZE FOR.
So it simply cannot know from the information it stores whether the 2 search terms together are likely to be quite independent or commonly found together. Maybe you should have 2 separate procedures one for single word queries that you let the optimiser do its stuff on and one for multi word procedures that you force a "good enough" plan on (sys.dm_fts_index_keywords might help if you don't want a one size fits all plan).
NB: Your single word procedure would likely need the WITH RECOMPILE option looking at this bit of the article.
In SQL Server 2008 full-text search we have the ability to alter the plan that is generated based on a cardinality estimation of the search term used. If the query plan is fixed (as it is in a parameterized query inside a stored procedure), this step does not take place. Therefore, the compiled plan always serves this query, even if this plan is not ideal for a given search term.
Original Answer
Your new plan still looks pretty bad though. It looks like it is only returning 1 row from the full text query part but scanning all 770159 rows in the Product table.
How does this perform?
CREATE TABLE #tempResults
(
ID int primary key,
Name varchar(200),
DateMadeNew datetime
)
INSERT INTO #tempResults
SELECT
ID, Name, DateMadeNew
FROM Product
WHERE contains(Name, '"White Dress"')
SELECT TOP 1
*
FROM #tempResults
ORDER BY DateMadeNew desc
I can't see the linked execution plan, network police are blocking that, so this is just a guess...
if it is running fast without the TOP and ORDER BY, try doing this:
SELECT TOP 1
*
FROM (SELECT
ID, Name, DateMadeNew
FROM Product
WHERE contains(Name, '"White Dress"')
) dt
ORDER BY DateMadeNew desc
A couple of thoughts on this one:
1) Have you updated the statistics on the Product table? It would be useful to see the estimates and actual number of rows on the operations there too.
2) What version of SQL Server are you using? I had a similar issue with SQL Server 2008 that turned out to be nothing more than not having Service Pack 1 installed. Install SP1 and a FreeText query that was taking a couple of minutes (due to a huge number of actual executions against actual) went down to taking a second.
I had the same problem earlier.
The performance depends on which unique index you choose for full text indexing.
My table has two unique columns - ID and article_number.
The query:
select top 50 id, article_number, name, ...
from ARTICLE
CONTAINS(*,'"BLACK*" AND "WHITE*"')
ORDER BY ARTICLE_NUMBER
If the full text index is connected to ID then it is slow depending on the searched words.
If the full text index is connected to ARTICLE_NUMBER UNIQUE index then it was always fast.
I have better solution.
I. Let's first overview proposed solutions as they also may be used in some cases:
OPTION (HASH JOIN) - is not good as you may get error "Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN."
SELECT TOP 1 * FROM (ORIGINAL_SELECT) ORDER BY ... - is not good, when you need to use paginating results from you ORIGINAL_SELECT
sp_create_plan_guide - is not good, as to use plan_guide you have to save plan for specific sql statement, this won't work for dynamic sql statements (e.g. generated by ORM)
II. My Solution contains of two parts
1. Self join table used for Full Text search
2. Use MS SQL HASH Join Hints MSDN Join Hints
Your SQL :
SELECT TOP 1 ID, Name FROM Product WHERE contains(Name, '"White Dress"')
ORDER BY DateMadeNew desc
Should be rewritten as :
SELECT TOP 1 p.ID, p.Name FROM Product p INNER HASH JOIN Product fts ON fts.ID = p.ID
WHERE contains(fts.Name, '"White Dress"')
ORDER BY p.DateMadeNew desc
If you are using NHibernate with/without Castle Active Records, I've replied in post how to write interceptor to modify your query to replace INNER JOIN by INNER HASH JOIN
I have a table with about 20+ million records.
Structure is like:
EventId UNIQUEIDENTIFIER
SourceUserId UNIQUEIDENTIFIER
DestinationUserId UNIQUEIDENTIFIER
CreatedAt DATETIME
TypeId INT
MetaId INT
Table is receiving about 100k+ records each day.
I have indexes on each column except MetaId, as it is not used in 'where' clauses
The problem is when i want to pick up eg. latest 100 records for desired SourceUserId
Query sometimes takes up to 4 minutes to execute, which is not acceptable.
Eg.
SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
TypeId IN (2, 3, 4)
OR
(TypeId = 60 AND SrcMemberId != DstMemberId)
)
ORDER BY CreatedAt DESC
I can't do partitioning etc as I am using Standard version of SQL Server and Enterprise is too expensive.
I also think that the table is quite small to be that slow.
I think the problem is with ORDER BY clause as db must go through much bigger set of data.
Any ideas how to make it quicker ?
Perhaps relational database is not a good idea for that kind of data.
Data is always being picked up ordered by CreatedAt DESC
Thank you for reading.
PabloX
You'll likely want to create a composite index for this type of query - when the query runs slowly it is most likely choosing to scan down an index on the CreatedAt column and perform a residual filter on the SourceUserId value, when in reality what you want to happen is to jump directly to all records for a given SourceUserId ordered properly - to achieve this, you'll want to create a composite index primarily on SourceUserId (performing an equality check) and secondarily on CreateAt (to preserve the order within a given SourceUserId value). You may want to try adding the TypeId in as well, depending on the selectivity of this column.
So, the 2 that will most likely give the best repeatable performance (try them out and compare) would be:
Index on (SourceUserId, CreatedAt)
Index on (SourceUserId, TypeId, CreatedAt)
As always, there are also many other considerations to take into account with determining how/what/where to index, as Remus discusses in a separate answer one big consideration is covering the query vs. keeping lookups. Additionally you'll need to consider write volumes, possible fragmentation impact (if any), singleton lookups vs. large sequential scans, etc., etc.
I have indexes on each column except
MetaId
Non-covering indexes will likely hit the 'tipping point' and the query would revert to a table scan. Just adding an index on every column because it is used in a where clause does not equate good index design. To take your query for example, a good 100% covering index would be:
INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId, SrcMemberId, DstMemberId)
Following index is also usefull, altough it still going to cause lookups:
INDEX ON (SourceUserId , CreatedAt) INCLUDE (TypeId)
and finaly an index w/o any included column may help, but is just as likely will be ignored (depends on the column statistics and cardinality estimates):
INDEX ON (SourceUserId , CreatedAt)
But a separate index on SourceUSerId and one on CreatedAt is basically useless for your query.
See Index Design Basics.
The fact that the table has indexes built on GUID values, indicates a possible series of problems that would affect performance:
High index fragmentation: since new GUIDs are generated randomly, the index cannot organize them in a sequential order and the nodes are spread unevenly.
High number of page splits: the size of a GUID (16 bytes) causes many page splits in the index, since there's a greater chance than a new value wont't fit in the remaining space available in a page.
Slow value comparison: comparing two GUIDs is a relatively slow operation because all 33 characters must be matched.
Here a couple of resources on how to investigate and resolve these problems:
How to Detect Index Fragmentation in SQL Server 2000 and 2005
Reorganizing and Rebuilding Indexes
How Using GUIDs in SQL Server Affect Index Performance
I would recomend getting the data in 2 sep var tables
INSERT INTO #Table1
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
TypeId IN (2, 3, 4)
)
INSERT INTO #Table2
SELECT * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND
(
(TypeId = 60 AND SrcMemberId != DstMemberId)
)
then apply a unoin from the selects, ordered and top. Limit the data from the get go.
I suggest using a UNION:
SELECT TOP 100 x.*
FROM (SELECT a.*
FROM EVENTS a
WHERE a.typeid IN (2, 3, 4)
UNION ALL
SELECT b.*
FROM EVENTS b
WHERE b.typeid = 60
AND b.srcmemberid != b.dstmemberid) x
WHERE x.sourceuserid = '15b534b17-5a5a-415a-9fc0-7565199c3461'
We've realised a minor gain by moving to a BIGINT IDENTITY key for our event table; by using that as a clustered primary key, we can cheat and use that for date ordering.
I would make sure CreatedAt is indexed properly
you could split the query in two with an UNION to avoid the OR (which can cause your index not to be used), something like
SElect * FROM(
SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND TypeId IN (2, 3, 4)
UNION SELECT TOP 100 * FROM Events WITH (NOLOCK)
WHERE SourceUserId = '15b534b17-5a5a-415a-9fc0-7565199c3461'
AND TypeId = 60 AND SrcMemberId != DstMemberId
)
ORDER BY CreatedAt DESC
Also, check that the uniqueidentifier indexes are not CLUSTERED.
If there are 100K records added each day, you should check your index fragmentation.
And rebuild or reorganize it accordingly.
More info :
SQLauthority