Getting top n latest entries from SQL Server full text index - sql

I have a table in a SQL Server 2008 R2 database
Article (Id, art_text)
Id is the primary key.
art_text has a full text index.
I search for latest articles that contain the word 'house' like this:
SELECT TOP 100 Id, art_text
FROM Article
WHERE CONTAINS(art_text, 'house')
ORDER BY Id DESC
This returns the correct results but it is slow (~5 seconds). The table has 20 million rows
and 350,000 of those contain the word house. I can see in the query plan that an index scan is performed in the clustered index for the 350,000 Ids returned by the full text index.
The query could be much faster if there would be a way to get only the latest 100 entries in the full text index that contain the word 'house'. Is there any way to do this in a way that the query is faster?

The short answer is yes, there are ways to make this particular query fun faster, but with a corpus of 20 million rows, 5 seconds isn't bad. You'll need to seriously consider whether the below suggestions are optimal for your FT search workload and weigh the costs vs the benefits. If you blindly implement these, you're going to have a bad time.
General Suggestions for Improving Sql Server Full-text Search Performance
Reduce the size of the Full-Text index being searched The smaller the FT Index, the faster the query. There are a couple of ways to reduce the FT index size. The first two may or may not apply and the third would take considerable work to accomplish.
Add domain specific noise words Noise words are words that don't add value to full-text search queries, such as "the", "and", "in", etc. If there are terms related to the business that add no value being indexed, you may benefit from excluding them from the FT index. Consider a hypothetical full-text index on the MSDN library. Terms such as "Microsoft", "library", "include", "dll" and "reference" may not add value to search results. (Is there any real value in going to http://msdn.microsoft.com and searching for "microsoft"?) A FT index of legal opinions might exclude words such as "defendant", "prosecution" and "legal", etc.
Strip out extraneous data using iFilters Full-Text search using Windows iFilters to extract text from binary documents. This is the same technology that window search functionality uses to search pdf and powerpoint documents. The one case where this is particularly useful is when you have a description column that can contain HTML markup. By default, Sql Server full-text search will index everything, so you get terms such as "font-family", "Arial" and "href" as searchable terms. Using the HTML iFilter can strip out the markup.
The two requirements for using an iFilter in your FT index is that the indexed column is a VARBINARY and there is a "type" column that contains the file extension. Both these can be accomplished with computed columns.
CREATE TABLE t (
....
description varbinary(max),
FTS_description as (CAST(description as VARBINARY(MAX)),
FTS_filetype as ( N'.html' )
)
-- Then create the fulltext index on FTS_description specifying the filetype.
Index portions of the table and stitch together results There are several ways to accomplish this, but the overall idea is to split the table into smaller chunks, query the chunks individually and combine the results. For example, you could create two indexed views, one for the current year and one for historical years with full-text indexes on them. Your query to return 100 rows changes to look like this:
DECLARE #rows int
DECLARE #ids table (id int not null primary key)
INSERT INTO #ids (id)
SELECT TOP (100) id
FROM vw_2013_FTDocuments WHERE CONTAINS (....)
ORDER BY Id DESC
SET #rows = ##rowcount
IF #rows < 100
BEGIN
DECLARE #rowsLeft int
SET #rowsLeft = 100 - #rows
INSERT INTO #ids (id) SELECT TOP (#rowsLeft) ......
--Logic to incorporate the historic data
END
SELECT ... FROM t INNER JOIN #ids .....
This can result in a substantial reduction in query times at the cost of adding complexity to the search logic. This approach is also applicable when searches are typically limited to a subset of the data. For example, craigslist might have a FT index for Housing, one for "For Sale" and one for "Employment". Any searches done from the home page would be stitched together from the individual indexes while the common case of searches within a category are more efficient.
Unsupported technique that will probably break in a future version of Sql Server.
You'll need to test extensively with data of the same quantity and quality as production. If the behavior changes in future versions of Sql server, you will have no right to complain. This is based off of observations, not proof. Use at your own RISK!!
A bit of full-text history In Sql Server 2005, the full-text search functionality was in an external process from sqlservr.exe. The way FTS was incorporated into query plans was as a black-box. Sql server would pass FTS a query, FTS would return a stream of id's. This limited the plans to available to Sql Server to plans where the FTS operator could basically be treated as a table scan.
In Sql Server 2008, FTS was integrated into the engine which improved performance. It also gave the optimizer new options for FTS query plans. Specifically, it now has the option to probe into the FTS index inside a LOOP JOIN operator to check if individual rows match the FTS predicate.(see http://sqlblog.com/blogs/joe_chang/archive/2012/02/19/query-optimizer-gone-wild-full-text.aspx for an excellent discussion of this and ways things can go wrong .)
Requirements for our optimal FTS query plan There are two characteristics to strive for to get the optimal query plan.
No Sort Operations. Sorting is slow, and we don't want to sort either 20 million rows or 350,000 rows.
Don't return all 350k rows matching the FTS predicate. We need to avoid this if at all possible.
These two criteria eliminate any plan with a hash join, as a hash join requires consuming all of one input to build the hash table.
For plans with a loop join, there are two options. Scan the clustered index backwards, and for each row probe into the fulltext search engine to see if that particular row matches. In theory, this seems like a good solution, as once we match 100 rows, we're done. We may have to try 10,000 id's to find the 100 that match, but that may be better than reading all 350k. It could also be worse (see above link to Joe Chang's blog) if each probe is expensive, then our 10k probes could take substantially longer than just reading all 350k rows.
The other loop join option is to have the FTS portion on the outer side of the loop, and seek into the clustered index. Unfortunately, the FTS engine doesn't like to return results in reverse order, so we'd have to read all 350k, and then sort them to return the top 100.
The roadblock is getting the FTS engine to return rows in reverse order. If we can overcome this, then we can reduce the IO's to reading only the last 100 rows that match. Fortunately the FTS engine has a tendancy to return rows in order by the key of the unique index specified when the index was created. (This is a natural side-effect of the internal storage the FTS engine uses)
By adding a computed column that is the negative of the id, and specifying a unique index on that column when creating the FT index, then we're really close.
CREATE TABLE t (id int not null primary key, txt varchar(max), neg_id as (-id) persisted )
CREATE UNIQUE INDEX IX_t_neg_id on t (neg_id)
CREATE FULLTEXT INDEX on t ( txt ) KEY INDEX IX_t_neg_id
Now for our query, we'll use CONTAINSTABLE, and some LEFT-join trickery to ensure that the FTS predicate doesn't end up on the inside of a LOOP JOIN.
SELECT TOP (100) t.id, t.txt
FROM CONTAINSTABLE(t, txt, 'house') ft
LEFT JOIN t on tf.[Key] = t.neg_id ORDER BY tf.[key]
The resulting plan should be a loop join that reads only the last 100 rows from the FT index.
Small gusts of wind that could blow down this house of cards:
Complex FTS queries (as in multiple terms or the use of NOT or OR operators can cause Sql 2008+ to get "Smart" and translate the logic into Multiple FTS queries that are joined in the query plan.
Any Cumulative Update, Service Pack or Major version upgrade could render this approach useless.
It may work in 95% of the cases and timeout in the remaining 5%.
It may not work at all for you.
Good Luck!

Related

SQL Indexing: None, Single Column, and Multiple Columns

How does indexing work in SQL and what benefits does it provide? What reason would there be for not indexing? And what is the difference between indexing a single column vs. indexing multiple columns?
How does indexing work in SQL and what benefits does it provide?
When you index columns you express your intent to query the indexed columns in conditional expressions, such as equality or range queries. With this information the storage engine can build a structure that makes such queries faster, often arranging them in tree structures. B-trees are the most common ones, but a lot of different structures exists, such as hash indices, R-tree indices for spatial data etc. Each structure is specialized in a certain type of look ups. For instance, hash indices are very fast for equality conditions, such as:
SELECT * FROM example_table WHERE type = "example";
SELECT * FROM example_table WHERE id = X;
B-trees are also fairly quick for equality look ups, but their main strength is that they support range queries:
SELECT * FROM example_table WHERE id > 5 AND id < 10
SELECT * FROM example_table WHERE type = "example" and value > 25
It is VERY important, however, when you build B-tree indices to understand that the tree is ordered in a "left-to-right" manner. I.e, if you build a B-tree index (lets call it A) on {type, value}, then you NEED to have a condition on the type-column in order for the query to be able to utilize the index. The example index can NOT be used in a query where the condition solely depends on value.
Furthermore, if you mix equality and a range condition, make sure that the equality columns are listed first in the index, otherwise the index can only be partially used.
What reason would there be for not indexing?
If the selectivity of the index is low, then you might not gain much over a table scan. say for instance that you have an index on a field called gender. Then the selectivity of that index will be low, since a lookup on that index will return half the rows of the original table. You can read a pretty simple explanation on selectivity here, and the reasoning behind it: http://mattfleming.com/node/192
Also, maintaining an index has a cost. For each data manipulation the index might need restructuring. So keeping the amount of indices to the minimum required to perform well on the queries against that table might be desirable.
What is the difference between indexing a single column vs. indexing multiple columns?
Once again, it depends on the type of queries you issue. Indexing a single column gender might not be a good idea, since the selectivity is low. When the selectivity is high then such an index makes much more sense. For instance, indices on the primary key is a very good index, since the selectivity is high (actually, it is as high as it gets. Each key in the index corresponds to exactly on record), and indices on columns with unique or highly different values (such as slugs, password hashes and what not) are also good single column indices.
There is also the concept of covering indices. Basically, each leaf in an index contains a pointer into the table where the row is stored (unless the index is a clustered index. In this case the leaf is the record). So for each index hit, the query engine has to fetch the corresponding table row, increasing the number of I/O-operations. Since I/O is extremely slow, you want to keep this to a minimum. Now, lets say that you often need to query for something, and also fetch some additional columns, then you can create a covering index, trading storage space for query performance. Example: Let's find the name and email of all users who has joined in the last 6 months (assuming MySQL):
With index on {joined_at}:
SELECT first_name, last_name, email
FROM users
WHERE joined_at > NOW() - INTERVAL 6 MONTH;
Query explanation:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users ALL test NULL NULL NULL 873 Using where
As you can see in the type-column, the query engine resorted to a full table scan, since the index selectivity was too low to be worthwhile using in this query (too many results would be returned, and thus followed into the table, costing too much in I/O)
With index on {joined_at, first_name, last_name, email}:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE users range test,test2 test2 8 NULL 514 Using where;
Using index
Now, since all the information that is necessary to complete the query is available in the index, the query engine evaluates that it is much better to use the index (with 514 rows) instead of doing a full table scan.
So as you can see, by using covering indices we can speed up queries for partial selects of a table, even if the selectivity of the index is quite small.
How does indexing work in SQL
That's a pretty open question but basically databases store a structure that enables faster look up of information. That structure is dependent on the implementation but its typically a type of tree.
what benefits does it provide?
Queries that are SARGable can be significantly faster.*
What reason would there be for not indexing?
Some data modification queries can take longer and there is storage cost to indexes but generally speaking, both these considerations are negligible.
And what is the difference between indexing a single column vs. indexing multiple columns?
There isn't much difference but sometimes people create covering indexes** that index mutliple columns to increase the performance of a specific query.
*SARGable is from Search ARGument ABLE. Basically if you do WHERE FOO > 5 it can be faster if FOO is indexed. On the other hand WHERE h(FOO) > 5 probably won't benefit from an index.
** If all the fields used in the SELECT JOIN and WHERE of a statement are also in an index a database can retrieve all the information it needs without going back to the base table. This is called a covering index. If all the fields were in separate indexes it would only use the ones for the joins and where and then go back to the base table for the columns in the select.

How do i optimize this query?

I have a very specific query. I tried lots of ways but i couldn't reach the performance i want.
SELECT *
FROM
items
WHERE
user_id=1
AND
(item_start < 20000 AND item_end > 30000)
i created and index on user_id, item_start, item_end
this didn't work and i dropped all indexes and create new indexes
user_id, (item_start, item_end)
also this didn't work.
(user_id, item_start and item_end are int)
edit: database is MySQL 5.1.44, engine is InnoDB
UPDATE: per your comment below, you need all the columns in the query (hence your SELECT *). If that's the case, you have a few options to maximize query performance:
create (or change) your clustered index to be on item_user_id, item_start, item_end. This will ensure that as few rows as possible are examined for each query. Per my original answer below, this approach may speed up this particular query but may slow down others, so you'll need to be careful.
if it's not practical to change your clustered index, you can create a non-clustered index on item_user_id, item_start, item_end and any other columns your query needs. This will slow down inserts somewhat, and will double the storage required for your table, but will speed up this particular query.
There are always other ways to increase performance (e.g. by reducing the size of each row) but the primary way is to decrease the number of rows which must be accessed and to increase the % of rows which are accessed sequentially rather than randomly. The indexing suggestions above do both.
ORIGINAL ANSWER BELOW:
Without knowing the exact schema or query plan, the main performance problem with this query is that SELECT * forces a lookup back to your clustered index for every row. If there are large numbers of matching rows for a particular user ID and if your clustered index's first column is not item_user_id, then this will likley be a very inefficient operation because your disk will be trying to fetch lots of randomly distributed rows from teh clustered inedx.
In other words, even thouggh filtering the rows you want is fast (because of your index), actually fetching the data is slower. .
If, however, your clustered index is ordered by item_user_id, item_start, item_end then that should speed things up. Note that this is not a panacea, since if you have other queries which depend on different ordering, or if you're inserting rows in a differnet order, you could end up slowing down other queries.
A less impactful solution would be to create a covering index which contains only the columns you want (also ordered by item_user_id, item_start, item_end, and then add the other cols you need). THen change your query to only pull back the cols you need, instead of using SELECT *.
If you could post more info about the DBMS brand and version, and the schema of your table, and we can help with more details.
Do you need to SELECT *?
If not, you can create a index on user_id, item_start, item_end with the fields you need in the SELECT-part as included columns. This all assuming you're using Microsoft SQL Server 2005+

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.
Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.
I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.
Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.
It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).
yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer
To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.

SQL Server Index - Any improvement for LIKE queries?

We have a query that runs off a fairly large table that unfortunately needs to use LIKE '%ABC%' on a couple varchar fields so the user can search on partial names, etc. SQL Server 2005
Would adding an index on these varchar fields help any in terms of select query performance when using LIKE or does it basically ignore the indexes and do a full scan in those cases?
Any other possible ways to improve performance when using LIKE?
Only if you add full-text searching to those columns, and use the full-text query capabilities of SQL Server.
Otherwise, no, an index will not help.
You can potentially see performance improvements by adding index(es), it depends a lot on the specifics :)
How much of the total size of the row are your predicated columns? How many rows do you expect to match? Do you need to return all rows that match the predicate, or just top 1 or top n rows?
If you are searching for values with high selectivity/uniqueness (so few rows to return), and the predicated columns are a smallish portion of the entire row size, an index could be quite useful. It will still be a scan, but your index will fit more rows per page than the source table.
Here is an example where the total row size is much greater than the column size to search across:
create table t1 (v1 varchar(100), b1 varbinary(8000))
go
--add 10k rows of filler
insert t1 values ('abc123def', cast(replicate('a', 8000) as varbinary(8000)))
go 10000
--add 1 row to find
insert t1 values ('abc456def', cast(replicate('a', 8000) as varbinary(8000)))
go
set statistics io on
go
select * from t1 where v1 like '%456%'
--shows 10001 logical reads
--create index that only contains the column(s) to search across
create index t1i1 on t1(v1)
go
select * from t1 where v1 like '%456%'
--or can force to
--shows 37 logical reads
If you look at the actual execution plan you can see the engine scanned the index and did a bookmark lookup on the matching row. Or you can tell the optimizer directly to use the index, if it hadn't decide to use this plan on its own:
select * from t1 with (index(t1i1)) where v1 like '%456%'
If you have a bunch of columns to search across only a few that are highly selective, you could create multiple indexes and use a reduction approach. E.g. first determine a set of IDs (or whatever your PK is) from your highly selective index, then search your less selective columns with a filter against that small set of PKs.
If you always need to return a large set of rows you would almost certainly be better off with a table scan.
So the possible optimizations depend a lot on the specifics of your table definition and the selectivity of your data.
HTH!
-Adrian
The only other way (other than using fulltext indexing) you could improve performance is to use "LIKE ABC%" - don't add the wildcard on both ends of your search term - in that case, an index could work.
If your requirements are such that you have to have wildcards on both ends of your search term, you're out of luck...
Marc
Like '%ABC%' will always perform a full table scan. There is no way around that.
You do have a couple of alternative approaches. Firstly full text searching, it's really designed for this sort of problem so I'd look at that first.
Alternatively in some circumstances it might be appropriate to denormalize the data and pre-process the target fields into appropriate tokens, then add these possible search terms into a separate one to many search table. For example, if my data always consisted of a field containing the pattern 'AAA/BBB/CCC' and my users were searching on BBB then I'd tokenize that out at insert/update (and remove on delete). This would also be one of those cases where using triggers, rather than application code, would be much preferred.
I must emphasis that this is not really an optimal technique and should only be used if the data is a good match for the approach and for some reason you do not want to use full text search (and the database performance on the like scan really is unacceptable). It's also likely to produce maintenance headaches further down the line.
create statistics on that column. sql srever 2005 has optimized the in string search so you might benfit from that.

When should you use full-text indexing?

We have a whole bunch of queries that "search" for clients, customers, etc. You can search by first name, email, etc. We're using LIKE statements in the following manner:
SELECT *
FROM customer
WHERE fname LIKE '%someName%'
Does full-text indexing help in the scenario? We're using SQL Server 2005.
It will depend upon your DBMS. I believe that most systems will not take advantage of the full-text index unless you use the full-text functions. (e.g. MATCH/AGAINST in mySQL or FREETEXT/CONTAINS in MS SQL)
Here is two good articles on when, why, and how to use full-text indexing in SQL Server:
How To Use SQL Server Full-Text Searching
Solving Complex SQL Problems with Full-Text Indexing
FTS can help in this scenario, the question is whether it is worth it or not.
To begin with, let's look at why LIKE may not be the most effective search. When you use LIKE, especially when you are searching with a % at the beginning of your comparison, SQL Server needs to perform both a table scan of every single row and a byte by byte check of the column you are checking.
FTS has some better algorithms for matching data as does some better statistics on variations of names. Therefore FTS can provide better performance for matching Smith, Smythe, Smithers, etc when you look for Smith.
It is, however, a bit more complex to use FTS, as you'll need to master CONTAINS vs FREETEXT and the arcane format of the search. However, if you want to do a search where either FName or LName match, you can do that with one statement instead of an OR.
To determine if FTS is going to be effective, determine how much data you have. I use FTS on a database of several hundred million rows and that's a real benefit over searching with LIKE, but I don't use it on every table.
If your table size is more reasonable, less than a few million, you can get similar speed by creating an index for each column that you're going to be searching on and SQL Server should perform an index scan rather than a table scan.
According to my test scenario:
SQL Server 2008
10.000.000 rows each with a string like "wordA wordB
wordC..." (varies between 1 and 30 words)
selecting count(*) with CONTAINS(column, "wordB")
result size several hundred thousands
catalog size approx 1.8GB
Full-text index was in range of 2s whereas like '% wordB %' was in range of 1-2 minutes.
But this counts only if you don't use any additional selection criteria! E.g. if I used some "like 'prefix%'" on a primary key column additionally, the performance was worse since the operation of going into the full-text index costs more than doing a string search in some fields (as long those are not too much).
So I would recommend full-text index only in cases where you have to do a "free string search" or use some of the special features of it...
To answer the question specifically for MSSQL, full-text indexing will NOT help in your scenario.
In order to improve that query you could do one of the following:
Configure a full-text catalog on the column and use the CONTAINS() function.
If you were primarily searching with a prefix (i.e. matching from the start of the name), you could change the predicate to the following and create an index over the column.
where fname like 'prefix%'
(1) is probably overkill for this, unless the performance of the query is a big problem.