SQL Server Index - Any improvement for LIKE queries? - sql

We have a query that runs off a fairly large table that unfortunately needs to use LIKE '%ABC%' on a couple varchar fields so the user can search on partial names, etc. SQL Server 2005
Would adding an index on these varchar fields help any in terms of select query performance when using LIKE or does it basically ignore the indexes and do a full scan in those cases?
Any other possible ways to improve performance when using LIKE?

Only if you add full-text searching to those columns, and use the full-text query capabilities of SQL Server.
Otherwise, no, an index will not help.

You can potentially see performance improvements by adding index(es), it depends a lot on the specifics :)
How much of the total size of the row are your predicated columns? How many rows do you expect to match? Do you need to return all rows that match the predicate, or just top 1 or top n rows?
If you are searching for values with high selectivity/uniqueness (so few rows to return), and the predicated columns are a smallish portion of the entire row size, an index could be quite useful. It will still be a scan, but your index will fit more rows per page than the source table.
Here is an example where the total row size is much greater than the column size to search across:
create table t1 (v1 varchar(100), b1 varbinary(8000))
go
--add 10k rows of filler
insert t1 values ('abc123def', cast(replicate('a', 8000) as varbinary(8000)))
go 10000
--add 1 row to find
insert t1 values ('abc456def', cast(replicate('a', 8000) as varbinary(8000)))
go
set statistics io on
go
select * from t1 where v1 like '%456%'
--shows 10001 logical reads
--create index that only contains the column(s) to search across
create index t1i1 on t1(v1)
go
select * from t1 where v1 like '%456%'
--or can force to
--shows 37 logical reads
If you look at the actual execution plan you can see the engine scanned the index and did a bookmark lookup on the matching row. Or you can tell the optimizer directly to use the index, if it hadn't decide to use this plan on its own:
select * from t1 with (index(t1i1)) where v1 like '%456%'
If you have a bunch of columns to search across only a few that are highly selective, you could create multiple indexes and use a reduction approach. E.g. first determine a set of IDs (or whatever your PK is) from your highly selective index, then search your less selective columns with a filter against that small set of PKs.
If you always need to return a large set of rows you would almost certainly be better off with a table scan.
So the possible optimizations depend a lot on the specifics of your table definition and the selectivity of your data.
HTH!
-Adrian

The only other way (other than using fulltext indexing) you could improve performance is to use "LIKE ABC%" - don't add the wildcard on both ends of your search term - in that case, an index could work.
If your requirements are such that you have to have wildcards on both ends of your search term, you're out of luck...
Marc

Like '%ABC%' will always perform a full table scan. There is no way around that.
You do have a couple of alternative approaches. Firstly full text searching, it's really designed for this sort of problem so I'd look at that first.
Alternatively in some circumstances it might be appropriate to denormalize the data and pre-process the target fields into appropriate tokens, then add these possible search terms into a separate one to many search table. For example, if my data always consisted of a field containing the pattern 'AAA/BBB/CCC' and my users were searching on BBB then I'd tokenize that out at insert/update (and remove on delete). This would also be one of those cases where using triggers, rather than application code, would be much preferred.
I must emphasis that this is not really an optimal technique and should only be used if the data is a good match for the approach and for some reason you do not want to use full text search (and the database performance on the like scan really is unacceptable). It's also likely to produce maintenance headaches further down the line.

create statistics on that column. sql srever 2005 has optimized the in string search so you might benfit from that.

Related

Getting top n latest entries from SQL Server full text index

I have a table in a SQL Server 2008 R2 database
Article (Id, art_text)
Id is the primary key.
art_text has a full text index.
I search for latest articles that contain the word 'house' like this:
SELECT TOP 100 Id, art_text
FROM Article
WHERE CONTAINS(art_text, 'house')
ORDER BY Id DESC
This returns the correct results but it is slow (~5 seconds). The table has 20 million rows
and 350,000 of those contain the word house. I can see in the query plan that an index scan is performed in the clustered index for the 350,000 Ids returned by the full text index.
The query could be much faster if there would be a way to get only the latest 100 entries in the full text index that contain the word 'house'. Is there any way to do this in a way that the query is faster?
The short answer is yes, there are ways to make this particular query fun faster, but with a corpus of 20 million rows, 5 seconds isn't bad. You'll need to seriously consider whether the below suggestions are optimal for your FT search workload and weigh the costs vs the benefits. If you blindly implement these, you're going to have a bad time.
General Suggestions for Improving Sql Server Full-text Search Performance
Reduce the size of the Full-Text index being searched The smaller the FT Index, the faster the query. There are a couple of ways to reduce the FT index size. The first two may or may not apply and the third would take considerable work to accomplish.
Add domain specific noise words Noise words are words that don't add value to full-text search queries, such as "the", "and", "in", etc. If there are terms related to the business that add no value being indexed, you may benefit from excluding them from the FT index. Consider a hypothetical full-text index on the MSDN library. Terms such as "Microsoft", "library", "include", "dll" and "reference" may not add value to search results. (Is there any real value in going to http://msdn.microsoft.com and searching for "microsoft"?) A FT index of legal opinions might exclude words such as "defendant", "prosecution" and "legal", etc.
Strip out extraneous data using iFilters Full-Text search using Windows iFilters to extract text from binary documents. This is the same technology that window search functionality uses to search pdf and powerpoint documents. The one case where this is particularly useful is when you have a description column that can contain HTML markup. By default, Sql Server full-text search will index everything, so you get terms such as "font-family", "Arial" and "href" as searchable terms. Using the HTML iFilter can strip out the markup.
The two requirements for using an iFilter in your FT index is that the indexed column is a VARBINARY and there is a "type" column that contains the file extension. Both these can be accomplished with computed columns.
CREATE TABLE t (
....
description varbinary(max),
FTS_description as (CAST(description as VARBINARY(MAX)),
FTS_filetype as ( N'.html' )
)
-- Then create the fulltext index on FTS_description specifying the filetype.
Index portions of the table and stitch together results There are several ways to accomplish this, but the overall idea is to split the table into smaller chunks, query the chunks individually and combine the results. For example, you could create two indexed views, one for the current year and one for historical years with full-text indexes on them. Your query to return 100 rows changes to look like this:
DECLARE #rows int
DECLARE #ids table (id int not null primary key)
INSERT INTO #ids (id)
SELECT TOP (100) id
FROM vw_2013_FTDocuments WHERE CONTAINS (....)
ORDER BY Id DESC
SET #rows = ##rowcount
IF #rows < 100
BEGIN
DECLARE #rowsLeft int
SET #rowsLeft = 100 - #rows
INSERT INTO #ids (id) SELECT TOP (#rowsLeft) ......
--Logic to incorporate the historic data
END
SELECT ... FROM t INNER JOIN #ids .....
This can result in a substantial reduction in query times at the cost of adding complexity to the search logic. This approach is also applicable when searches are typically limited to a subset of the data. For example, craigslist might have a FT index for Housing, one for "For Sale" and one for "Employment". Any searches done from the home page would be stitched together from the individual indexes while the common case of searches within a category are more efficient.
Unsupported technique that will probably break in a future version of Sql Server.
You'll need to test extensively with data of the same quantity and quality as production. If the behavior changes in future versions of Sql server, you will have no right to complain. This is based off of observations, not proof. Use at your own RISK!!
A bit of full-text history In Sql Server 2005, the full-text search functionality was in an external process from sqlservr.exe. The way FTS was incorporated into query plans was as a black-box. Sql server would pass FTS a query, FTS would return a stream of id's. This limited the plans to available to Sql Server to plans where the FTS operator could basically be treated as a table scan.
In Sql Server 2008, FTS was integrated into the engine which improved performance. It also gave the optimizer new options for FTS query plans. Specifically, it now has the option to probe into the FTS index inside a LOOP JOIN operator to check if individual rows match the FTS predicate.(see http://sqlblog.com/blogs/joe_chang/archive/2012/02/19/query-optimizer-gone-wild-full-text.aspx for an excellent discussion of this and ways things can go wrong .)
Requirements for our optimal FTS query plan There are two characteristics to strive for to get the optimal query plan.
No Sort Operations. Sorting is slow, and we don't want to sort either 20 million rows or 350,000 rows.
Don't return all 350k rows matching the FTS predicate. We need to avoid this if at all possible.
These two criteria eliminate any plan with a hash join, as a hash join requires consuming all of one input to build the hash table.
For plans with a loop join, there are two options. Scan the clustered index backwards, and for each row probe into the fulltext search engine to see if that particular row matches. In theory, this seems like a good solution, as once we match 100 rows, we're done. We may have to try 10,000 id's to find the 100 that match, but that may be better than reading all 350k. It could also be worse (see above link to Joe Chang's blog) if each probe is expensive, then our 10k probes could take substantially longer than just reading all 350k rows.
The other loop join option is to have the FTS portion on the outer side of the loop, and seek into the clustered index. Unfortunately, the FTS engine doesn't like to return results in reverse order, so we'd have to read all 350k, and then sort them to return the top 100.
The roadblock is getting the FTS engine to return rows in reverse order. If we can overcome this, then we can reduce the IO's to reading only the last 100 rows that match. Fortunately the FTS engine has a tendancy to return rows in order by the key of the unique index specified when the index was created. (This is a natural side-effect of the internal storage the FTS engine uses)
By adding a computed column that is the negative of the id, and specifying a unique index on that column when creating the FT index, then we're really close.
CREATE TABLE t (id int not null primary key, txt varchar(max), neg_id as (-id) persisted )
CREATE UNIQUE INDEX IX_t_neg_id on t (neg_id)
CREATE FULLTEXT INDEX on t ( txt ) KEY INDEX IX_t_neg_id
Now for our query, we'll use CONTAINSTABLE, and some LEFT-join trickery to ensure that the FTS predicate doesn't end up on the inside of a LOOP JOIN.
SELECT TOP (100) t.id, t.txt
FROM CONTAINSTABLE(t, txt, 'house') ft
LEFT JOIN t on tf.[Key] = t.neg_id ORDER BY tf.[key]
The resulting plan should be a loop join that reads only the last 100 rows from the FT index.
Small gusts of wind that could blow down this house of cards:
Complex FTS queries (as in multiple terms or the use of NOT or OR operators can cause Sql 2008+ to get "Smart" and translate the logic into Multiple FTS queries that are joined in the query plan.
Any Cumulative Update, Service Pack or Major version upgrade could render this approach useless.
It may work in 95% of the cases and timeout in the remaining 5%.
It may not work at all for you.
Good Luck!

Index on multiple bit fields in SQL Server

We currently have a scenario where one table effectively has several (10 to 15) boolean flags (not nullable bit fields). Unfortunately, it is not really possible to simplify this too much on a logical level, because any combination of the boolean values is permissible.
The table in question is a transactional table which may end up having tens of millions of rows, and both insert and select performance is fairly critical. Although we are not quite sure of the distribution of the data at this time, the combination of all flags should provide relative good cardinality, i.e. make it a "worthwhile" index for SQL Server to make use of.
Typical select query scenarios might be to select records based on 3 or 4 of the flags only, e.g. WHERE FLAG3=1 AND FLAG7=0 AND FLAG9=1. It would not be practical to create separate indexes for all combinations of the flags used by these select queries, as there will be many of them.
Given this situation, what would be the recommended approach to effectively index these fields? The table is new, so there is no existing data to worry about yet, and we have a fair amount of flexibility in the actual implementation of the table.
There are two main options that we are considering at the moment:
Create a single index which includes all the bit fields (this would probably include 1 or 2 other int fields which would always used). My concern is that given the typical usage of only including a few of the fields, this approach would skip the index and resort to a table scan. Let's call this Option A (Having read some of the replies, it seems that this approach would not work well, since the order of the fields in the index would make a difference making it impossible to index effectively on ALL the fields).
Effectively do what I believe SQL Server is doing internally, and encode the bit fields into a single int field using binary operators (AND-ing and OR-ing numbers together: 1, 2, 4, 8, etc). My concern here is that we'd need to do some kind of calculation to query on this encoded field, which would skip the index again. Maintenance and the complexity of this solution is also a concern. Let's call this Option B. Additional info: The argument for this approach is that we could have a relatively simple and short index which includes one or two other fields from the table and this field. The other fields would narrow down the number of records needing to be evaluated, and since the encoded field would contain all of our bit fields, SQL Server would be able to perform the calculation using the data retrieved from the index directly (i.e. an index scan) as opposed to the table (i.e. a table scan).
At the moment, we are heavily leaning towards Option B. For completeness, this would be running on SQL Server 2008.
Any advice would be greatly appreciated.
Edit: Spelling, clarity, query example, additional info on Option B.
A single BIT column typically is not selective enough to be even considered for use in an index. So an index on a single BIT column really doesn't make sense - on average, you'd always have to search about half the entries in the table (50% selectiveness) and so the SQL Server query optimizer will instead use a table scan.
If you create a single index on all 15 bit columns, then you don't have that problem - since you have 15 yes/no options, your index will become quite selective.
Trouble is: the sequence of the bit columns is important. Your index will only ever be considered if your SQL statement uses at least 1-n of the left-most BIT columns.
So if your index is on
Col1,Col2,Col3,....,Col14,Col15
then it might be used for a query that uses
Col1
Col1 and Col2
Col1 and Col2 and Col3
....
and so on. But it cannot be used for a query that specifies Col6,Col9 and Col14.
Because of that, I don't really think an index on your collection of BIT columns really makes a lot of sense.
Are those 15 BIT columns the only columns you use for querying? If not, I would try to combine those BIT columns that you use most for selection with other columns, e.g. have an index on Name and Col7 or something (then your BIT columns can add some additional selectivity to another index)
Whilst there are probably ways to solve your indexing problem against your existing table schema, I would reduce this to a normalisation problem:
e.g I would highly recommend creating a series of new tables:
Lookup table for the names of this bit flags. e.g. CREATE TABLE Flags (id int IDENTITY(1,1), Name varchar(256)) (you don't have to make id an identity-seed column if you want to manually control the id's - e.g. 2,4,8,16,32,64,128 as binary flags.)
Create a new link-table that contains the id's of the original data table and the new link table e.g. CREATE TABLE DataFlags_Link (id int IDENTITY(1,1), MyFlagId int, DataId int)
You could then create an index on the DataFlags_Link table and write queries like:
SELECT Data.*
FROM Data
INNER JOIN DataFlags_Link ON Data.id = DataFlags_Link.DataId
WHERE DataFlags_Link.MyFlagId IN (4,7,2,8)
As for performance, that's where good DBA maintenance comes in. You'll want to set the INDEX fill-factor and padding on your tables appropriately and run regular index defragmentation or rebuild your indexes on a schedule.
Performance and maintenance go hand-in-hand with databases. You can't have one without the other.
Whilst I think Neil Fenwick's answer is probably right, I think the real answer is to try out the different options and see which one is fast enough.
Option 1 is probably the most straightforward solution, and therefore likely the most maintainable - and it may well be fast enough.
I would build a prototype database, with the "option 1" schema, and use something like http://www.red-gate.com/products/sql-development/sql-data-generator/ or http://sourceforge.net/projects/dbmonster/ to create twice as much data as you expect to need, and then build the queries you expect to need. Agree an acceptable response time, and only consider a "faster" schema if you exceed those response times (and you can't throw hardware at the problem).
Neil's solution is probably just as obvious and maintainable as "option 1" - and it should be easy to index. However, I'd still test it by creating a prototype schema and generating a lot of test data...

How to search millions of record in SQL table faster?

I have SQL table with millions of domain name. But now when I search for let's say
SELECT *
FROM tblDomainResults
WHERE domainName LIKE '%lifeis%'
It takes more than 10 minutes to get the results. I tried indexing but that didn't help.
What is the best way to store this millions of record and easily access these information in short period of time?
There are about 50 million records and 5 column so far.
Most likely, you tried a traditional index which cannot be used to optimize LIKE queries unless the pattern begins with a fixed string (e.g. 'lifeis%').
What you need for your query is a full-text index. Most DBMS support it these days.
Assuming that your 50 million row table includes duplicates (perhaps that is part of the problem), and assuming SQL Server (the syntax may change but the concept is similar on most RDBMSes), another option is to store domains in a lookup table, e.g.
CREATE TABLE dbo.Domains
(
DomainID INT IDENTITY(1,1) PRIMARY KEY,
DomainName VARCHAR(255) NOT NULL
);
CREATE UNIQUE INDEX dn ON dbo.Domains(DomainName);
When you load new data, check if any of the domain names are new - and insert those into the Domains table. Then in your big table, you just include the DomainID. Not only will this keep your 50 million row table much smaller, it will also make lookups like this much more efficient.
SELECT * -- please specify column names
FROM dbo.tblDomainResults AS dr
INNER JOIN dbo.Domains AS d
ON dr.DomainID = d.DomainID
WHERE d.DomainName LIKE '%lifeis%';
Of course except on the tiniest of tables, it will always help to avoid LIKE clauses with a leading wildcard.
Full-text indexing is the far-and-away best option here - how this is accomplished will depend on the DBMS you're using.
Short of that, ensuring that you have an index on the column being matched with the pattern will help performance, but by the sounds of it, you've tried this and it didn't help a great deal.
Stop using LIKE statement. You could use fulltext search, but it will require MyISAM table and isn't all that good solution.
I would recommend for you to examine available 3rd party solutions - like Lucene and Sphinx. They will be superior.
One thing you might want to consider is having a separate search engine for such lookups. For example, you can use a SOLR (lucene) server to search on and retrieve the ids of entries that match your search, then retrieve the data from the database by id. Even having to make two different calls, its very likely it will wind up being faster.
Indexes are slowed down whenever they have to go lookup ("bookmark lookup") data that the index itself doesn't contain. For instance, if your index has 2 columns, ID, and NAME, but you're selecting * (which is 5 columns total) the database has to read the index for the first two columns, then go lookup the other 3 columns in a less efficient data structure somewhere else.
In this case, your index can't be used because of the "like". This is similar to not putting any where filter on the query, it will skip the index altogether since it has to read the whole table anyway it will do just that ("table scan"). There is a threshold (i think around 35-50% where the engine normally flips over to this).
In short, it seems unlikely that you need all 50 million rows from the DB for a production application, but if you do... use a machine with more memory and try methods that keep that data in memory. Maybe a No-SQL DB would be a better option - mongoDB, couch DB, tokyo cabinet. Things like this. Good luck!
You could try breaking up the domain into chunks and then searh the chunks themselves. I did some thing like that years ago when I needed to search for words in sentences. I did not have full text searching available so I broke up the sentences into a word list and searched the words. It was really fast to find the results since the words were indexed.

how to optimize sql server table for faster response?

i found a in a table there are 50 thousands records and it takes one minute when we fetch data from sql server table just by issuing a sql. there are one primary key that means a already a cluster index is there. i just do not understand why it takes one minute. beside index what are the ways out there to optimize a table to get the data faster. in this situation what i need to do for faster response. also tell me how we can write always a optimize sql. please tell me all the steps in detail for optimization.
thanks.
The fastest way to optimize indexes in table is to use SQL Server Tuning Advisor. Take a look http://www.youtube.com/watch?v=gjT8wL92mqE <-- here
Select only the columns you need, rather than select *. If your table has some large columns e.g. OLE types or other binary data (maybe used for storing images etc) then you may be transferring vastly more data off disk and over the network than you need.
As others have said, an index is no help to you when you are selecting all rows (no where clause). Using an index would be slower in such cases because of the index read and table lookup for each row, vs full table scan.
If you are running select * from employee (as per question comment) then no amount of indexing will help you. It's an "Every column for every row" query: there is no magic for this.
Adding a WHERE won't help usually for select * query too.
What you can check is index and statistics maintenance. Do you do any? Here's a Google search
Or change how you use the data...
Edit:
Why a WHERE clause usually won't help...
If you add a WHERE that is not the PK..
you'll still need to scan the table unless you add an index on the searched column
then you'll need a key/bookmark lookup unless you make it covering
with SELECT * you need to add all columns to the index to make it covering
for a many hits, the index will probably be ignored to avoid key/bookmark lookups.
Unless there is a network issue or such, the issue is reading all columns not lack of WHERE
If you did SELECT col13 FROM MyTable and had an index on col13, the index will probably be used.
A SELECT * FROM MyTable WHERE DateCol < '20090101' with an index on DateCol but matched 40% of the table, it will probably be ignored or you'd have expensive key/bookmark lookups
Irrespective of the merits of returning the whole table to your application that does sound an unexpectedly long time to retrieve just 50000 rows of employee data.
Does your query have an ORDER BY or is it literally just select * from employee?
What is the definition of the employee table? Does it contain any particularly wide columns? Are you storing binary data such as their CVs or employee photo in it?
How are you issuing the SQL and retrieving the results?
What isolation level are your select statements running at (You can use SQL Profiler to check this)
Are you encountering blocking? Does adding NOLOCK to the query speed things up dramatically?

SQL `LIKE` complexity

Does anyone know what the complexity is for the SQL LIKE operator for the most popular databases?
Let's consider the three core cases separately. This discussion is MySQL-specific, but might also apply to other DBMS due to the fact that indexes are typically implemented in a similar manner.
LIKE 'foo%' is quick if run on an indexed column. MySQL indexes are a variation of B-trees, so when performing this query it can simply descend the tree to the node corresponding to foo, or the first node with that prefix, and traverse the tree forward. All of this is very efficient.
LIKE '%foo' can't be accelerated by indexes and will result in a full table scan. If you have other criterias that can by executed using indices, it will only scan the the rows that remain after the initial filtering.
There's a trick though: If you need to do suffix matching - searching for file names with extension .foo, for instance - you can achieve the same performance by adding a column with the same contents as the original one but with the characters in reverse order.
ALTER TABLE my_table ADD COLUMN col_reverse VARCHAR (256) NOT NULL;
ALTER TABLE my_table ADD INDEX idx_col_reverse (col_reverse);
UPDATE my_table SET col_reverse = REVERSE(col);
Searching for rows with col ending in .foo then becomes:
SELECT * FROM my_table WHERE col_reverse LIKE 'oof.%'
Finally, there's LIKE '%foo%', for which there are no shortcuts. If there are no other limiting criterias which reduces the amount of rows to a feasible number, it'll cause a hard performance hit. You might want to consider a full text search solution instead, or some other specialized solution.
If you are asking about the performance impact:
The problem of like is that it keeps the database from using an index. On Oracle I think it doesn't use indexes anymore (but I'm still on Oracle 9). SqlServer uses indexes if the wildcard is only at the end. I don't know about other databases.
Depends on the RDBMS, the data (and possibly size of data), indexes and how the LIKE is used (with or without prefix wildcard)!
You are asking too general a question.