I need to extract information from a text field which can contain one of many values. The SQL looks like:
SELECT fieldname
FROM table
WHERE bigtextfield LIKE '%val1%'
OR bigtextfield LIKE '%val2%'
OR bigtextfield LIKE '%val3%'
.
.
.
OR bigtextfield LIKE '%valn%'
My question is: how efficient is this when the number of values approaches the hundreds, and possibly thousands? Is there a better way to do this?
One solution would be to create a new table/column with just the values I'm after and doing the following:
SELECT fieldname
FROM othertable
WHERE value IN ('val1', 'val2', 'val3', ... 'valn')
Which I imagine is a lot more efficient as it only has to do exact string matching. The problem with this is that it will be a lot of work keeping this table up to date.
btw I'm using MS SQL Server 2005.
This functionality is already present in most SQL engines, including MS SQL Server 2005. It's called full-text indexing; here are some resources:
developer.com: Understanding SQL Server Full-Text Indexing
MSDN article: Introduction to Full-Text Search
I don't think the main problem is the number of criteria values - but the sheer fact that a WHERE clause with bigtextfield LIKE '%val1%' can never really be very efficient - even with just a single value.
The trouble is the fact that if you have a placeholder like "%" at the beginning of your search term, all the indices are out the window and cannot be used anymore.
So you're basically just searching each and every entry in your table doing a full table scan in the process. Now your performance basically just depends on the number of rows in your table....
I would support intgr's recommendation - if you need to do this frequently, have a serious look at fulltext indexing.
This will inevitable require a fullscan (over the table or over an index) with a filter.
The IN condition won't help here, since it does not work on LIKE
You could do something like this:
SELECT *
FROM master
WHERE EXISTS
(
SELECT NULL
FROM values
WHERE name LIKE '%' + value + '%'
)
, but this hardly be more efficient.
All literal conditions will be transformed to a CONSTANT SCAN which is just like selecting from the same table, but built in memory.
The best solution for this is to redesign and get rid of that field that is storing multiple values and make it a related table instead. This violates one of the first rules of database design.
You should not be storing multiple values in one field and dead slow queries like this are the reason why. If you can't do that then full-test indexing is your only hope.
Related
I've got a table with close to 5kk rows. Each one of them has one text column where I store my XML logs
I am trying to find out if there's some log having
<node>value</node>
I've tried with
SELECT top 1 id_log FROM Table_Log WHERE log_text LIKE '%<node>value</node>%'
but it never finishes.
Is there any way to improve this search?
PS: I can't drop any log
A wildcarded query such as '%<node>value</node>%' will result in a full table scan (ignoring indexes) as it can't determine where within the field it'll find the match. The only real way I know of to improve this query as it stands (without things like partitioning the table etc which should be considered if the table is logging constantly) would be to add a Full-Text catalog & index to the table in order to provide a more efficient search over that field.
Here is a good reference that should walk you through it. Once this has been completed you can use things like the CONTAINS and FREETEXT operators that are optimised for this type of retrieval.
Apart from implementing full-text search on that column and indexing the table, maybe you can narrow the results by another parameters (date, etc).
Also, you could add a table field (varchar type) called "Tags" which you can populate when inserting a row. This field would register "keywords, tags" for this log. This way, you could change your query with this field as condition.
Unfortunately, about the only way I can see to optimize that is to implement full-text search on that column, but even that will be hard to construct to where it only returns a particular value within a particular element.
I'm currently doing some work where I'm also storing XML within one of the columns. But I'm assuming any queries needed on that data will take a long time, which is okay for our needs.
Another option has to do with storing the data in a binary column, and then SQL Server has options for specifying what type of document is stored in that field. This allows you to, for example, implement more meaningful full-text searching on that field. But it's hard for me to imagine this will efficiently do what you are asking for.
You are using a like query.
No index involved = no good
There is nothing you can do with what you have currently to speed this up unfortunately.
I don't think it will help but try using the FAST x query hint like so:
SELECT id_log
FROM Table_Log
WHERE log_text LIKE '%<node>value</node>%'
OPTION(FAST 1)
This should optimise the query to return the first row.
Let's say TableA has a field called, FieldA. FieldA consists of a list of values with comma being the delimiter (i.e. 1,2,3,4). #MyVar# contains the value (a single value, NOT a list) I want to search for. Other than doing the followings,
SELECT *
FROM TableA
WHERE FieldA LIKE '%,#MyVar#,%'
OR FieldA LIKE '#MyVar#,%'
OR FieldA LIKE '%,#MyVar#'
Is there a SQL command/reserved word to perform the equivalent of the SQL statement above?
You have a badly designed, denormalized database. If at all possible, fix the design.
If that's not possible then consider the nature of the MyVar values. If they cannot possibly nest inside each other (for instance, they're all guaranteed to be the same number of characters) you can get away with a single like. If not, your solution is about as good as you'll get.
As per Pounding A Nail: Old Shoe or Glass Bottle?:
This is quite possibly the worst way
you could store this data in a
database. No, seriously. They had
better ways of doing this over thirty
years ago. You have created a horrible
problem that is only starting to rear
its ugly head. It will only continue
to get horribly worse, costing your
company more and more money to
maintain.
The answer is to make a new table, with a one-to-many relationship between the main ID and the different values. Then it's a much simpler query:
SELECT MainID
FROM NewTable
WHERE FieldA = #Value
This is a much easier solution to code against, not to mention understand. And it will help significantly when you have to join this table to others.
EDIT: From your responses, I understand that doing this would be a serious amount of work. However, you're already running into the issues that are arising from the current setup. Doing this will take time up front, but it will make future development significantly faster in the long run.
There isn't any need to include the several ORs or commas. This query should do what you need.
SELECT *
FROM TableA
WHERE FieldA LIKE '%#MyVar#%'
If the variable appears anywhere in the field, the row will be returned. You don't have to try to compare it to each of your "list" items in each field.
You're kind of abusing database design by including a list of values in a single field. There isn't any built-in SQL function that deals with this kind of data, that I know of.
Put your delimiters on the left side of the like comparison too and then make sure you include your delimiters on the right side. Like this:
SELECT *
FROM TableA
WHERE ',' + FieldA + ',' LIKE '%,#MyVar#,%'
This actually should do it in MySQL... correct or am I missing something?
where find_in_set('1', field_with_a_list_of_numbers)
You can use IN to check an list:
SELECT * FROM TableA WHERE FieldA IN '%,#MyVar#,%'
No, in a word. In a normalized database you would usually split these list items into a separate table, but there are times where is this is either not possible (user has no ability to alter the database structure) or undesired (budget limitations).
you can either
do something like durilai suggests
(not performant for large amounts of
data, but probably the easiest to
implement)
use a trigger to split the entries in
your on arrival (i.e. on INSERT) into a subtable (SubTableA) and
then do
:-
select * from TableA a where exists (select 1 from SubTableA s on s.a_id = a.id
and list_element like '%#MyVar#%')
I have an SQL Server 2005 table that has a varchar(250) field which contains keywords that I will use for searching purposes. I can't change the design. The data looks like this...
Personal, Property, Cost, Endorsement
What is the most efficient way to run search queries against these keywords? The only thing I can think of is this...
WHERE Keywords LIKE '%endorse%'
Since normalization is not an option, the next best option is going to be to configure and use Full Text Search. This will maintain an internal search index that will make it very easy for you to search within your data.
The problem with solutions like LIKE '%pattern%' is that this will produce a full table scan (or maybe a full index scan) that could produce locks on a large amount of the data in your table, which will slow down any operations that hit the table in question.
the most efficient way is to normalize your db design. never store CSV values into a single cell.
other than using like you might consider full text search.
You could use PATINDEX()
USE AdventureWorks;
GO
SELECT PATINDEX('%ensure%',DocumentSummary)
FROM Production.Document
WHERE DocumentID = 3;
GO
http://msdn.microsoft.com/en-us/library/ms188395%28SQL.90%29.aspx
Does anyone know what the complexity is for the SQL LIKE operator for the most popular databases?
Let's consider the three core cases separately. This discussion is MySQL-specific, but might also apply to other DBMS due to the fact that indexes are typically implemented in a similar manner.
LIKE 'foo%' is quick if run on an indexed column. MySQL indexes are a variation of B-trees, so when performing this query it can simply descend the tree to the node corresponding to foo, or the first node with that prefix, and traverse the tree forward. All of this is very efficient.
LIKE '%foo' can't be accelerated by indexes and will result in a full table scan. If you have other criterias that can by executed using indices, it will only scan the the rows that remain after the initial filtering.
There's a trick though: If you need to do suffix matching - searching for file names with extension .foo, for instance - you can achieve the same performance by adding a column with the same contents as the original one but with the characters in reverse order.
ALTER TABLE my_table ADD COLUMN col_reverse VARCHAR (256) NOT NULL;
ALTER TABLE my_table ADD INDEX idx_col_reverse (col_reverse);
UPDATE my_table SET col_reverse = REVERSE(col);
Searching for rows with col ending in .foo then becomes:
SELECT * FROM my_table WHERE col_reverse LIKE 'oof.%'
Finally, there's LIKE '%foo%', for which there are no shortcuts. If there are no other limiting criterias which reduces the amount of rows to a feasible number, it'll cause a hard performance hit. You might want to consider a full text search solution instead, or some other specialized solution.
If you are asking about the performance impact:
The problem of like is that it keeps the database from using an index. On Oracle I think it doesn't use indexes anymore (but I'm still on Oracle 9). SqlServer uses indexes if the wildcard is only at the end. I don't know about other databases.
Depends on the RDBMS, the data (and possibly size of data), indexes and how the LIKE is used (with or without prefix wildcard)!
You are asking too general a question.
I have a column in a non-partitioned Oracle table defined as VARCHAR2(50); the column has a standard b-tree index. I was wondering if there is an optimal way to query this column to determine whether it contains a given value. Here is the current query:
SELECT * FROM my_table m WHERE m.my_column LIKE '%'||v_value||'%';
I looked at Oracle Text, but that seems like overkill for such a small column. However, there are millions of records in this table so looking for substring matches is taking more time than I'd like. Is there a better way?
No.
That query is a table scan. If v_value is an actual word, then you may very well want to look at Oracle Text or a simple inverted index scheme you roll your on your own. But as is, it's horrible.
Oracle Text covers a number of different approaches, not all of them heavyweight. As your column is quite small you could index it with a CTXCAT index.
SELECT * FROM my_table m
WHERE catsearch(m.my_column, v_value, null) > 0
/
Unlike the other type of Text index, CTXCAT indexes are transactional, so they do not require synchronisation. Such indexes consume a lot of space, but that you have to pay some price for improved performance.
Find out more.
You have three choices:
live with it;
use something like Oracle Text for full-text searching; or
redefine the problem so you can implement a faster solution.
The simplest way to redefine the problem is to say the column has to start with the search term (so lose the first %), which will then use the index.
An alternative way is to say that the search starts on word boundaries (so "est" will match "estimate" but not "test"). MySQL (MyISAM) and SQL Server have functions that will do matching like this. Not sure if Oracle does. If it doesn't you could create a lookup table of words to search instead of the column itself and you could populate that table on a trigger.
You could put a function-based index on the column, using the REGEXP_LIKE function. You might need to create the fbi with a case statement to return '1' with a match, as boolean returning functions dont seem to be valid in fbi.
Here is an example.
Create the index:
CREATE INDEX regexp_like_on_myCol ON my_table (
CASE WHEN REGEXP_LIKE(my_column, '[static exp]', 'i')
THEN 1
END);
And then to use it, instead of:
SELECT * FROM my_table m WHERE m.my_column LIKE '%'||v_value||'%';
you will need to perform a query like the following:
SELECT * FROM my_table m WHERE (
CASE WHEN REGEXP_LIKE(m.my_column, '[static exp]', 'i')
THEN 1
END) IS NOT NULL;
A significant shortcomming in this approach is that you will need to know your '[static exp]' at the time that you create your index. If you are looking for a performance increase while performing ad hoc queries, this might not be the solution for you.
A bonus though, as the function name indicates, is that you have the opportunity to create this index using regex, which could be a powerful tool in the end. The evaluation hit will be taken when items are added to the table, not during the search.
You could try INSTR:
...WHERE INSTR(m.my_column, v_value) > 0
I don't have access to Oracle to test & find out if it is faster than LIKE with wildcarding.
For the most generic case where you do not know in advance the string you are searching for then the best access path you can hope for is a fast full index scan. You'd have to focus on keeping the index as small as possible, which might have it's own problems of course, and could look at a compressed index if the data is not very high cardinality.