Efficient way to Query a Delimited Varchar Field in SQL - sql

I have an SQL Server 2005 table that has a varchar(250) field which contains keywords that I will use for searching purposes. I can't change the design. The data looks like this...
Personal, Property, Cost, Endorsement
What is the most efficient way to run search queries against these keywords? The only thing I can think of is this...
WHERE Keywords LIKE '%endorse%'

Since normalization is not an option, the next best option is going to be to configure and use Full Text Search. This will maintain an internal search index that will make it very easy for you to search within your data.
The problem with solutions like LIKE '%pattern%' is that this will produce a full table scan (or maybe a full index scan) that could produce locks on a large amount of the data in your table, which will slow down any operations that hit the table in question.

the most efficient way is to normalize your db design. never store CSV values into a single cell.
other than using like you might consider full text search.

You could use PATINDEX()
USE AdventureWorks;
GO
SELECT PATINDEX('%ensure%',DocumentSummary)
FROM Production.Document
WHERE DocumentID = 3;
GO
http://msdn.microsoft.com/en-us/library/ms188395%28SQL.90%29.aspx

Related

How can I improve this endless query?

I've got a table with close to 5kk rows. Each one of them has one text column where I store my XML logs
I am trying to find out if there's some log having
<node>value</node>
I've tried with
SELECT top 1 id_log FROM Table_Log WHERE log_text LIKE '%<node>value</node>%'
but it never finishes.
Is there any way to improve this search?
PS: I can't drop any log
A wildcarded query such as '%<node>value</node>%' will result in a full table scan (ignoring indexes) as it can't determine where within the field it'll find the match. The only real way I know of to improve this query as it stands (without things like partitioning the table etc which should be considered if the table is logging constantly) would be to add a Full-Text catalog & index to the table in order to provide a more efficient search over that field.
Here is a good reference that should walk you through it. Once this has been completed you can use things like the CONTAINS and FREETEXT operators that are optimised for this type of retrieval.
Apart from implementing full-text search on that column and indexing the table, maybe you can narrow the results by another parameters (date, etc).
Also, you could add a table field (varchar type) called "Tags" which you can populate when inserting a row. This field would register "keywords, tags" for this log. This way, you could change your query with this field as condition.
Unfortunately, about the only way I can see to optimize that is to implement full-text search on that column, but even that will be hard to construct to where it only returns a particular value within a particular element.
I'm currently doing some work where I'm also storing XML within one of the columns. But I'm assuming any queries needed on that data will take a long time, which is okay for our needs.
Another option has to do with storing the data in a binary column, and then SQL Server has options for specifying what type of document is stored in that field. This allows you to, for example, implement more meaningful full-text searching on that field. But it's hard for me to imagine this will efficiently do what you are asking for.
You are using a like query.
No index involved = no good
There is nothing you can do with what you have currently to speed this up unfortunately.
I don't think it will help but try using the FAST x query hint like so:
SELECT id_log
FROM Table_Log
WHERE log_text LIKE '%<node>value</node>%'
OPTION(FAST 1)
This should optimise the query to return the first row.

Sql Search in millions of records. Possible?

I have a table in my sql server 2005 database which contains about 50 million records.
I have firstName and LastName columns, and I would like to be able to allow the user to search on these columns without it taking forever.
Out of indexing these columns, is there a way to make my query work fast?
Also, I want to search similar sounded names. for example, if the user searches for Danny, I would like to return records with the name Dan, Daniel as well. It would be nice to show the user a rank in % how close the result he got to what he actually searched.
I know this is a tuff task, but I bet I'm not the first one in the world that face this issue :)
Thanks for your help.
We have databases with half a billion of records (Oracle, but should have similar performances). You can search in it within a few milli seconds if you have proper indexes. In your case, place an index on firstname and lastname. Using binary-tree index will perform good and will scale with the size of your database. Careful, LIKE clauses often break the use of the index and degrades largely the performances. I know MySQL can keep using indexes with LIKE clauses when wildcards are only at the right of the string. You would have to make similar search for SQL Server.
String similarity is indeed not simple. Have a look at http://en.wikipedia.org/wiki/Category:String_similarity_measures, you'll see some of the possible algorithms. Cannot say if SQL Server do implement one of them, dont know this database. Try to Google "SQL Server" + the name of the algorithms to maybe find what you need. Otherwise, you have code provided on Wiki for various languages (maybe not SQL but you should be able to adapt them for a stored procedure).
Have you tried full text indexing? I used it on free text fields in a table over 1 million records, and found it to be pretty fast. Plus you can add synonyms to it, so that Dan, Danial, and Danny all index as the same (where you get the dictionary of name equivalents is a different story). It does allow wildcard searches as well. Full text indexing can also do rank, though I found it to be less useful on names (better for documents).
use FUll TEXT SEARCH enable for this table and those columns, that will create full text index for those columns.

efficiency of SQL 'LIKE' statement with large number of clauses

I need to extract information from a text field which can contain one of many values. The SQL looks like:
SELECT fieldname
FROM table
WHERE bigtextfield LIKE '%val1%'
OR bigtextfield LIKE '%val2%'
OR bigtextfield LIKE '%val3%'
.
.
.
OR bigtextfield LIKE '%valn%'
My question is: how efficient is this when the number of values approaches the hundreds, and possibly thousands? Is there a better way to do this?
One solution would be to create a new table/column with just the values I'm after and doing the following:
SELECT fieldname
FROM othertable
WHERE value IN ('val1', 'val2', 'val3', ... 'valn')
Which I imagine is a lot more efficient as it only has to do exact string matching. The problem with this is that it will be a lot of work keeping this table up to date.
btw I'm using MS SQL Server 2005.
This functionality is already present in most SQL engines, including MS SQL Server 2005. It's called full-text indexing; here are some resources:
developer.com: Understanding SQL Server Full-Text Indexing
MSDN article: Introduction to Full-Text Search
I don't think the main problem is the number of criteria values - but the sheer fact that a WHERE clause with bigtextfield LIKE '%val1%' can never really be very efficient - even with just a single value.
The trouble is the fact that if you have a placeholder like "%" at the beginning of your search term, all the indices are out the window and cannot be used anymore.
So you're basically just searching each and every entry in your table doing a full table scan in the process. Now your performance basically just depends on the number of rows in your table....
I would support intgr's recommendation - if you need to do this frequently, have a serious look at fulltext indexing.
This will inevitable require a fullscan (over the table or over an index) with a filter.
The IN condition won't help here, since it does not work on LIKE
You could do something like this:
SELECT *
FROM master
WHERE EXISTS
(
SELECT NULL
FROM values
WHERE name LIKE '%' + value + '%'
)
, but this hardly be more efficient.
All literal conditions will be transformed to a CONSTANT SCAN which is just like selecting from the same table, but built in memory.
The best solution for this is to redesign and get rid of that field that is storing multiple values and make it a related table instead. This violates one of the first rules of database design.
You should not be storing multiple values in one field and dead slow queries like this are the reason why. If you can't do that then full-test indexing is your only hope.

What is a good way to optimize an Oracle query looking for a substring match?

I have a column in a non-partitioned Oracle table defined as VARCHAR2(50); the column has a standard b-tree index. I was wondering if there is an optimal way to query this column to determine whether it contains a given value. Here is the current query:
SELECT * FROM my_table m WHERE m.my_column LIKE '%'||v_value||'%';
I looked at Oracle Text, but that seems like overkill for such a small column. However, there are millions of records in this table so looking for substring matches is taking more time than I'd like. Is there a better way?
No.
That query is a table scan. If v_value is an actual word, then you may very well want to look at Oracle Text or a simple inverted index scheme you roll your on your own. But as is, it's horrible.
Oracle Text covers a number of different approaches, not all of them heavyweight. As your column is quite small you could index it with a CTXCAT index.
SELECT * FROM my_table m
WHERE catsearch(m.my_column, v_value, null) > 0
/
Unlike the other type of Text index, CTXCAT indexes are transactional, so they do not require synchronisation. Such indexes consume a lot of space, but that you have to pay some price for improved performance.
Find out more.
You have three choices:
live with it;
use something like Oracle Text for full-text searching; or
redefine the problem so you can implement a faster solution.
The simplest way to redefine the problem is to say the column has to start with the search term (so lose the first %), which will then use the index.
An alternative way is to say that the search starts on word boundaries (so "est" will match "estimate" but not "test"). MySQL (MyISAM) and SQL Server have functions that will do matching like this. Not sure if Oracle does. If it doesn't you could create a lookup table of words to search instead of the column itself and you could populate that table on a trigger.
You could put a function-based index on the column, using the REGEXP_LIKE function. You might need to create the fbi with a case statement to return '1' with a match, as boolean returning functions dont seem to be valid in fbi.
Here is an example.
Create the index:
CREATE INDEX regexp_like_on_myCol ON my_table (
CASE WHEN REGEXP_LIKE(my_column, '[static exp]', 'i')
THEN 1
END);
And then to use it, instead of:
SELECT * FROM my_table m WHERE m.my_column LIKE '%'||v_value||'%';
you will need to perform a query like the following:
SELECT * FROM my_table m WHERE (
CASE WHEN REGEXP_LIKE(m.my_column, '[static exp]', 'i')
THEN 1
END) IS NOT NULL;
A significant shortcomming in this approach is that you will need to know your '[static exp]' at the time that you create your index. If you are looking for a performance increase while performing ad hoc queries, this might not be the solution for you.
A bonus though, as the function name indicates, is that you have the opportunity to create this index using regex, which could be a powerful tool in the end. The evaluation hit will be taken when items are added to the table, not during the search.
You could try INSTR:
...WHERE INSTR(m.my_column, v_value) > 0
I don't have access to Oracle to test & find out if it is faster than LIKE with wildcarding.
For the most generic case where you do not know in advance the string you are searching for then the best access path you can hope for is a fast full index scan. You'd have to focus on keeping the index as small as possible, which might have it's own problems of course, and could look at a compressed index if the data is not very high cardinality.

SQL full text search vs "LIKE"

Let's say I have a fairly simple app that lets users store information on DVDs they own (title, actors, year, description, etc.) and I want to allow users to search their collection by any of these fields (e.g. "Keanu Reeves" or "The Matrix" would be valid search queries).
What's the advantage of going with SQL full text search vs simply splitting the query up by spaces and doing a few "LIKE" clauses in the SQL statement? Does it simply perform better or will it actually return results that are more accurate?
Full text search is likely to be quicker since it will benefit from an index of words that it will use to look up the records, whereas using LIKE is going to need to full table scan.
In some cases LIKE will more accurate since LIKE "%The%" AND LIKE "%Matrix" will pick out "The Matrix" but not "Matrix Reloaded" whereas full text search will ignore "The" and return both. That said both would likely have been a better result.
Full-text indexes (which are indexes) are much faster than using LIKE (which essentially examines each row every time). However, if you know the database will be small, there may not be a performance need to use full-text indexes. The only way to determine this is with some intelligent averaging and some testing based on that information.
Accuracy is a different question. Full-text indexing allows you to do several things (weighting, automatically matching eat/eats/eating, etc.) you couldn't possibly implement that in any sort of reasonable time-frame using LIKE. The real question is whether you need those features.
Without reading the full-text documentation's description of these features, you're really not going to know how you should proceed. So, read up!
Also, some basic tests (insert a bunch of rows in a table, maybe with some sort of public dictionary as a source of words) will go a long way to helping you decide.
A full text search query is much faster. Especially when working which lots of data in various columns.
Additionally you will have language specific search support. E.g. german umlauts like "ü" in "über" will also be found when stored as "ueber". Also you can use synonyms where you can automatically expand search queries, or replace or substitute specific phrases.
In some cases LIKE will more accurate
since LIKE "%The%" AND LIKE "%Matrix"
will pick out "The Matrix" but not
"Matrix Reloaded" whereas full text
search will ignore "The" and return
both. That said both would likely have
been a better result.
That is not correct. The full text search syntax lets you specify "how" you want to search. E.g. by using the CONTAINS statement you can use exact term matching as well fuzzy matching, weights etc.
So if you have performance issues or would like to provide a more "Google-like" search experience, go for the full text search engine. It is also very easy to configure.
Just a few notes:
LIKE can use an Index Seek if you don't start your LIKE with %. Example: LIKE 'Santa M%' is good! LIKE '%Maria' is bad! and can cause a Table or Index Scan because this can't be indexed in the standard way.
This is very important. Full-Text Indexes updates are Asynchronous. For instance, if you perform an INSERT on a table followed by a SELECT with Full-Text Search where you expect the new data to appear, you might not get the data immediatly. Based on your configuration, you may have to wait a few seconds or a day. Generally, Full-Text Indexes are populated when your system does not have many requests.
It will perform better, but unless you have a lot of data you won't notice that difference. A SQL full text search index lets you use operators that are more advanced then a simple "LIKE" operation, but if all you do is the equivalent of a LIKE operation against your full text index then your results will be the same.
Imagine if you will allow to enter notes/descriptions on DVDs.
In this case it will be good to allow to search by descriptions.
Full text search in this case will do better job.
You may get slightly better results, or else at least have an easier implementation with full text indexing. But it depends on how you want it to work ...
What I have in mind is that if you are searching for two words, with LIKE you have to then manually implement (for example) a method to weight those with both higher in the list. A fulltext index should do this for you, and allow you to influence the weightings too using relevant syntax.
To FullTextSearch in SQL Server as LIKE
First, You have to create a StopList and assign it to your table
CREATE FULLTEXT STOPLIST [MyStopList];
GO
ALTER FULLTEXT INDEX ON dbo.[MyTableName] SET STOPLIST [MyStopList]
GO
Second, use the following tSql script:
SELECT * FROM dbo.[MyTableName] AS mt
WHERE CONTAINS((mt.ColumnName1,mt.ColumnName2,mt.ColumnName3), N'"*search text s*"')
If you do not just search English word, say you search a Chinese word, then how your fts tokenizes words will make your search a big different, as I gave an example here https://stackoverflow.com/a/31396975/301513. But I don't know how sql server tokenizes Chinese words, does it do a good job for that?