I have a table of 2+ million (rows) products with 44 fields (columns).
I am attempting to query this table based on the 'NAME' field (varchar 160) which I have a fulltext index on.
Here is the query that is currrently taking 71.34 seconds to execute with a three word $keyword,
62.47 seconds with two word $keyword and 0.017 seconds to execute with a single word $keyword.
SELECT ID,
MATCH(NAME) AGAINST ('$keyword') as Relevance,
MANUFACTURER,
ADVERTISERCATEGORY,
THIRDPARTYCATEGORY,
DESCRIPTION,
AID,
SALEPRICE,
RETAILPRICE,
PRICE,
SKU,
BUYURL,
IMAGEURL,
NAME,
PROGRAMNAME
FROM products
WHERE MATCH(NAME) AGAINST ('$keyword' IN BOOLEAN MODE)
GROUP BY NAME
HAVING Relevance > 6
ORDER BY Relevance DESC LIMIT 24
How can I optimize this query to perform better on 2+ word $keyword queries?
This may not be quite the answer you were looking for, however:
With a table that large, fulltext searches are never going to be speedy. I would suggest looking into a fulltext engine like Sphinx.
As an added benefit, you can also do the relevancy matching in Sphinx, as it will return results ranked in order of relevance. Once Sphinx returns matching IDs, you can then just an IN statement in your query's WHERE clause and select the other data you need.
I would also suggest looking into Sphinx's Extended Query syntax, as this will let you match multiple words across things like proximity and word order.
Is table partioning an option with MySQL? As for SQL Server, it is an reveals to be useful in such situation.
Or perhaps changing the MATCH function for a standard WHERE ... LIKE clause. I don't know about MySQL, so I fear to be of lesser help. Sorry!
Related
I was asked to optimize a SQL query in one of the interviews I attended. The table PRODUCTS structure is like this:
PRODUCT_NAME - Which has around unique 200 values repeated
STATE - Which has around 20 unique values repeated
COUNTRY - Which has around unique 5 values repeated
The table contains 1 million rows. I was given the below SQL statement and was asked to complete it. The SQL is to fetch all the products for a particular state.
SELECT _______
FROM PRODUCTS
WHERE STATE = 'CALIFORNIA'
My answer was below:
SELECT PRODUCT_NAME, STATE, COUNTRY
FROM PRODUCTS
WHERE STATE = 'CALIFORNIA'
The interviewer was not happy with the answer and later told me that the order of the columns in the select clause could have been used to optimize and I had failed to do it.
So does the order of the columns being used in the select statement have any significant improvement in efficiency of a select query. If so, how?
I cannot fathom what the interviewer is thinking or what type of database the interviewer is referring to.
Databases store data on data pages, which use a binary format and contain other information (such as null-flags and perhaps record ids and page ids and so on). Retrieving values for a record requires parsing the data page -- and this takes place regardless of the order of the columns being returned by the query.
Perhaps the confusion is with indexes. Some databases recommend ordering the columns in a multi-column index based on selectivity (i.e. the number of values). When all columns in the index are used for equality comparisons, then there might be some slight optimization. However, the ordering of the columns in indexes is usually influenced by other factors, based on the queries being optimized.
The only optimization I can readily think of is removing columns. If you know the state, there is no reason to return the state. And you probably intend for that state to be in the United States, so the country is irrelevant as well. There might be some optimization to using a constant ('California' as state), but it is hard to imagine anyone actually caring about such a nano improvement in performance on a query that reads much of a large table.
i need advice how to get fastest result for querying on big size table.
I am using SQL Server 2012, my condition is like this:
I have 5 tables contains transaction record, each table has 35 millions of records.
All tables has 14 columns, the columns i need to search is GroupName, CustomerName, and NoRegistration. And I have a view that contains 5 of all these tables.
The GroupName, CustomerName, and NoRegistration records is not unique each tables.
My application have a function to search to these column.
The query is like this:
Search by Group Name:
SELECT DISTINCT(GroupName) FROM TransactionRecords_view WHERE GroupName LIKE ''+#GroupName+'%'
Search by Name:
SELECT DISTINCT(CustomerName) AS 'CustomerName' FROM TransactionRecords_view WHERE CustomerName LIKE ''+#Name+'%'
Search by NoRegistration:
SELECT DISTINCT(NoRegistration) FROM TransactionRecords_view WHERE LOWER(NoRegistration) LIKE LOWER(#NoRegistration)+'%'
My question is how can i achieve fastest execution time for searching?
With my condition right now, every time i search, it took 3 to 5 minutes.
My idea is to make a new tables contains the distinct of GroupName, CustomerName, and NoRegistration from all 5 tables.
Is my idea is make execution time is faster? or any other idea?
Thank you
EDIT:
This is query for view "TransactionRecords_view"
CREATE VIEW TransactionRecords_view
AS
SELECT * FROM TransactionRecords_1507
UNION ALL
SELECT * FROM TransactionRecords_1506
UNION ALL
SELECT * FROM TransactionRecords_1505
UNION ALL
SELECT * FROM TransactionRecords_1504
UNION ALL
SELECT * FROM TransactionRecords_1503
You must show sql of TransactionRecords_view. Do you have indexes? What is the collation of NoRegistration column? Paste the Actual Execution Plan for each query.
Ok, so you don't need to make those new tables. If you create Non-Clustered indexes based upon these fields it will (in effect) do what you're after. The index will only store data on the columns that you indicate, not the whole table. Be aware, however, that indexes are excellent to aid in SELECT statements but will negatively affect any write statements (INSERT, UPDATE etc).
Next you want to run the queries with the actual execution plan switched on. This will show you how the optimizer has decided to run each query (in the back end). Are there any particular issues here, are any of the steps taking up a lot of the overall operator cost? There are plenty of great instructional videos about execution plans on youtube, check them out if you haven't looked at exe plans before.
Did you try to check if there were missing indexes with the actual execution plan ?
Moreover, as you use clause on varchar, I've heard about Full-Text Search.. maybe it can be useful for you :
https://msdn.microsoft.com/en-us/library/ms142571(v=sql.120).aspx
I have an SQL query similar to below:
SELECT NAME,
MY_FUNCTION(NAME) -- carries out some string manipulation
FROM TITLES
ORDER BY NAME; -- has an index.
The TITLES table has approximately 12,000 records. At the moment the query takes over 5 minutes to execute but if I remove the ORDER BY clause then it executes within a couple of seconds.
Does anyone have any suggestions on how to be speed up this query.
If MY_FUNCTION is deterministic (i.e. always returns the same result for the same input value) then you could create an index on (NAME, MY_FUNCTION(NAME)) and it may help (or may not!)
In comments under the question, you say that it takes 2 seconds "to return N rows without the ORDER BY". That makes sense: without the ORDER BY you will just get the first N rows encountered, as soon as they are encountered. With the ORDER BY, the first N rows are returned only after the results have been sorted into the correct order.
If the query is being used in a situation where getting the first N rows fast is important (e.g. an online report with pagination) then you could try adding a FIRST_ROWS or FIRST_ROWS_n hint to the query, to try to persuade it to use the index. See Choosing an Optimizer Goal
Use the EXPLAIN statement to see where the issue is
EXPLAIN SELECT NAME, MY_FUNCTION(NAME) FROM TITLES ORDER BY NAME;
Sounds weird. What's name column type?
Have you checked for defective hardware errors? Maybe (just maybe) your query with the order by clause is using your index, and your index is located in a defective disk (it could be in a different disk from the table if they are located in different tablespaces).
The Product table has 700K records in it. The query:
SELECT TOP 1 ID,
Name
FROM Product
WHERE contains(Name, '"White Dress"')
ORDER BY DateMadeNew desc
takes about 1 minute to run. There is an non-clustered index on DateMadeNew and FreeText index on Name.
If I remove TOP 1 or Order By - it takes less then 1 second to run.
Here is the link to execution plan.
http://screencast.com/t/ZDczMzg5N
Looks like FullTextMatch has over 400K executions. Why is this happening? How can it be made faster?
UPDATE 5/3/2010
Looks like cardinality is out of whack on multi word FreeText searches:
Optimizer estimates that there are 28K records matching 'White Dress', while in reality there is only 1.
http://screencast.com/t/NjM3ZjE4NjAt
If I replace 'White Dress' with 'White', estimated number is '27,951', while actual number is '28,487' which is a lot better.
It seems like Optimizer is using only the first word in phrase being searched for cardinality.
Looks like FullTextMatch has over 400K executions. Why is this happening?
Since you have an index combined with TOP 1, optimizer thinks that it will be better to traverse the index, checking each record for the entry.
How can it be made faster?
If updating the statistics does not help, try adding a hint to your query:
SELECT TOP 1 *
FROM product pt
WHERE CONTAINS(name, '"test1"')
ORDER BY
datemadenew DESC
OPTION (HASH JOIN)
This will force the engine to use a HASH JOIN algorithm to join your table and the output of the fulltext query.
Fulltext query is regarded as a remote source returning the set of values indexed by KEY INDEX provided in the FULLTEXT INDEX definition.
Update:
If your ORM uses parametrized queries, you can create a plan guide.
Use Profiler to intercept the query that the ORM sends verbatim
Generate a correct plan in SSMS using hints and save it as XML
Use sp_create_plan_guide with an OPTION USE PLAN to force the optimizer always use this plan.
Edit
From http://technet.microsoft.com/en-us/library/cc721269.aspx#_Toc202506240
The most important thing is that the
correct join type is picked for
full-text query. Cardinality
estimation on the FulltextMatch STVF
is very important for the right plan.
So the first thing to check is the
FulltextMatch cardinality estimation.
This is the estimated number of hits
in the index for the full-text search
string. For example, in the query in
Figure 3 this should be close to the
number of documents containing the
term ‘word’. In most cases it should
be very accurate but if the estimate
was off by a long way, you could
generate bad plans. The estimation for
single terms is normally very good,
but estimating multiple terms such as
phrases or AND queries is more complex
since it is not possible to know what
the intersection of terms in the index
will be based on the frequency of the
terms in the index. If the cardinality
estimation is good, a bad plan
probably is caused by the query
optimizer cost model. The only way to
fix the plan issue is to use a query
hint to force a certain kind of join
or OPTIMIZE FOR.
So it simply cannot know from the information it stores whether the 2 search terms together are likely to be quite independent or commonly found together. Maybe you should have 2 separate procedures one for single word queries that you let the optimiser do its stuff on and one for multi word procedures that you force a "good enough" plan on (sys.dm_fts_index_keywords might help if you don't want a one size fits all plan).
NB: Your single word procedure would likely need the WITH RECOMPILE option looking at this bit of the article.
In SQL Server 2008 full-text search we have the ability to alter the plan that is generated based on a cardinality estimation of the search term used. If the query plan is fixed (as it is in a parameterized query inside a stored procedure), this step does not take place. Therefore, the compiled plan always serves this query, even if this plan is not ideal for a given search term.
Original Answer
Your new plan still looks pretty bad though. It looks like it is only returning 1 row from the full text query part but scanning all 770159 rows in the Product table.
How does this perform?
CREATE TABLE #tempResults
(
ID int primary key,
Name varchar(200),
DateMadeNew datetime
)
INSERT INTO #tempResults
SELECT
ID, Name, DateMadeNew
FROM Product
WHERE contains(Name, '"White Dress"')
SELECT TOP 1
*
FROM #tempResults
ORDER BY DateMadeNew desc
I can't see the linked execution plan, network police are blocking that, so this is just a guess...
if it is running fast without the TOP and ORDER BY, try doing this:
SELECT TOP 1
*
FROM (SELECT
ID, Name, DateMadeNew
FROM Product
WHERE contains(Name, '"White Dress"')
) dt
ORDER BY DateMadeNew desc
A couple of thoughts on this one:
1) Have you updated the statistics on the Product table? It would be useful to see the estimates and actual number of rows on the operations there too.
2) What version of SQL Server are you using? I had a similar issue with SQL Server 2008 that turned out to be nothing more than not having Service Pack 1 installed. Install SP1 and a FreeText query that was taking a couple of minutes (due to a huge number of actual executions against actual) went down to taking a second.
I had the same problem earlier.
The performance depends on which unique index you choose for full text indexing.
My table has two unique columns - ID and article_number.
The query:
select top 50 id, article_number, name, ...
from ARTICLE
CONTAINS(*,'"BLACK*" AND "WHITE*"')
ORDER BY ARTICLE_NUMBER
If the full text index is connected to ID then it is slow depending on the searched words.
If the full text index is connected to ARTICLE_NUMBER UNIQUE index then it was always fast.
I have better solution.
I. Let's first overview proposed solutions as they also may be used in some cases:
OPTION (HASH JOIN) - is not good as you may get error "Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN."
SELECT TOP 1 * FROM (ORIGINAL_SELECT) ORDER BY ... - is not good, when you need to use paginating results from you ORIGINAL_SELECT
sp_create_plan_guide - is not good, as to use plan_guide you have to save plan for specific sql statement, this won't work for dynamic sql statements (e.g. generated by ORM)
II. My Solution contains of two parts
1. Self join table used for Full Text search
2. Use MS SQL HASH Join Hints MSDN Join Hints
Your SQL :
SELECT TOP 1 ID, Name FROM Product WHERE contains(Name, '"White Dress"')
ORDER BY DateMadeNew desc
Should be rewritten as :
SELECT TOP 1 p.ID, p.Name FROM Product p INNER HASH JOIN Product fts ON fts.ID = p.ID
WHERE contains(fts.Name, '"White Dress"')
ORDER BY p.DateMadeNew desc
If you are using NHibernate with/without Castle Active Records, I've replied in post how to write interceptor to modify your query to replace INNER JOIN by INNER HASH JOIN
I have a SQL table it has more than 1000000 rows, and I need to select with the query as you can see below:
SELECT DISTINCT TOP (200) COUNT(1) AS COUNT, KEYWORD
FROM QUERIES WITH(NOLOCK)
WHERE KEYWORD LIKE '%Something%'
GROUP BY KEYWORD ORDER BY 'COUNT' DESC
Could you please tell me how can I optimize it to speed up the execution process? Thank you for useful answers.
I'd first look at the execution plan to see how sql server is trying to access your data. Here is a link to just one of many articles on execution plan analysis.
Asking a question about SQL Server performance without providing a schema is a complete waste of everybody time. I'm going to answer a different question, which is one you should had been ask in the first place:
What schema should I use to
efficiently satisfy a query like
SELECT DISTINCT TOP (200) COUNT(1) AS
COUNT, KEYWORD FROM QUERIES WHERE
KEYWORD LIKE '%Something%'GROUP BY
KEYWORD ORDER BY 'COUNT' DESC when QUERIES table has over 1M rows?
The proper schema depend on the selectivity of KEYWORD. One possible design would be to normalize KEYWORD into a lookup table and have a narrow non-clustered index on the lookup id:
CREATE TABLE KEYWORDS (KeywordId INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
Keyword VARCHAR(...) UNIQUE);
CREATE TABLE QUERIES (...,
KeywordId INT NOT NULL,
CONSTRAINT FK_KEYWORD
FOREIGN KEY KeywordId
REFERENCES KEYWORDS (KeywordId),
...);
CREATE INDEX ndxQueriesKeyword ON Queries (KeywordId);
If the number of distinct keyword is relatively low, the original query can be satisfied quickly by a scan of the Keywqord table followed by a nexted loop range scan of the ndxQueriesKeyword index, which is very narrow and therefore generates low IO.
As the number of distinct keyword increases, this approach may start showing problems due to the high number of range scans on the Queries table, and possible even due to the full scan on the Keywords table.
You may consider using a different WHERE clause, namely one LIKE 'Something%, which is SARGable and can leverage an index on KEYWORK, benefiting from a range reduction and a narrower scan than a full table scan.
If you are on Enterprise Edition you can consider adding an indexes view with the pre-computed aggregates:
CREATE VIEW vwQueryKeywords
WITH SCHEMABINDING
AS SELECT KEYWORD, COUNT_BIG(*) as COUNT
FROM dbo.QUERIES
GROUP BY KEYWORD;
CREATE CLUSTERED INDEX cdxQueryKeywords ON vwQueryKeywords(KEYWORD);
On EE the optimizer will consider the indexed view for the original query. On non-EE you will have to change the query to run against the view with the NOEXPAND hint:
SELECT KEYWORD, COUNT
FROM vwQueryKeywords WITH(NOEXPAND)
WHERE KEYWORD LIKE '%Something%';
Another completely different approach is to ditch the LIKE '%Something%' condition altogether in favor of fullt-text search:
SELECT DISTINCT TOP (200) COUNT(1) AS
COUNT, KEYWORD FROM QUERIES WHERE
CONTAINS (Keyword, Something)
GROUP BY
KEYWORD ORDER BY 'COUNT' DESC
Because the FT search is a reverse-index lookup, it may prove optimal over a traditional WHERE. The only issue is that you'll only be able to search for full words, since FT won't let you search partial matches the way LIKE does. Again, the actual mileage will vary based on Keyword data profile (ie. its statistics and distribution).
As Jeremy stated, you need to look at the execution plan and client statistics to see what is faster. However, a couple of suggestions. First, do you really need a prefixing wildcard on your search? I.e., LIKE '%Something%' will not be able to use an index whereas LIKE 'Something%' will. Second, you might try a CTE to see if will be faster. So, something like:
;With NumberedItems As
(
Select Keyword, Count(*) As [Count]
, ROW_NUMBER() OVER ( ORDER BY Keyword, Count(*) DESC ) As ItemRank
From Queries WITH (NOLOCK)
Where Keyword LIKE '%Something%'
Group By Keyword
)
Select Keyword, [Count]
From NumberedItems
Where ItemRank <= 200
It's rather hard to guess what may be causing the performance issues with just a query and no schema or execution plan. You should definitely read-up on them as all performance tuning of SQL queries is ultimately driven by the execution plan.
If you really want to delve into it, you can also read up on the query optimizer which attempts to execute your query using the most optimal plan. Understanding the optimizer is important to ensure you are taking full advantage of the indexes, etc. you have on the database. Microsoft also has several helpful documents such as this on troubleshooting performance issue.
For your particular case, the bottleneck is most likely in the WHERE clause. LIKE comparisons tend to be inefficient, especially when surrounded by percent signs as the query tends to be unable to take advantage of indexes on the column, etc. Depending on how you've stored data, full-text indexing may be a useful option, as that can frequently outperform LIKE '%SOMEVALUE%'.
If you can't use a full-text search engine from a third party, create an inverted index from your text periodically and search that instead. A naive implementation would beat your current strategy.
http://en.wikipedia.org/wiki/Inverted_index
Your query is not optimizable (without implementing some form of full-text indexing, itself expensive) because you have a leading wildcard in your keyword match. You would need to split the keywords out into separate column values (probably in a separate, related table) and search on an exact match or, at least, a match with the wildcard not at the beginning of the text.
Additionally the results you're getting may not be accurate if you have some keywords that are nested in others (eg "cart" will match a keyword search on "car", which is not what you want).