So apparently if a Mysql table's fulltext index contains a keyword that appears in 50% of the data rows, that keyword will be ignored by the match query
So if I have a table with the fulltext index 'content' which contains 50 entries
and 27 of the entries contains the word 'computer' in the content field, and I run the query:
SELECT *
FROM `table`
WHERE MATCH(`content`) AGAINST ('computer');
...the computer query will return zero results since computer appears in more than 50% of the entries and hence the keyword is ignored...
is there a way to disable this functionality especially since this is problematic in the beginning phase of the database's lifespan
Use BOOLEAN full-text searches to bypass 50% feature.
http://dev.mysql.com/doc/refman/5.0/en/fulltext-boolean.html
Yes, use the Boolean Full-Text Searched option
See here Mysql Manual
The manual says that "They do not use the 50% threshold."
The search in MySQL is not intuitive and the manual is slightly confusing so you need to read it carefully as well as preparing some test cases to make sure it is working and that you have implemented it correctly (from bitter experience).
There are a number of other search plugin's of varying complexity that you might want to look at that perform better but it is worth getting to grips with the native version as it is quick and (easy?) dirty.
Related
In Sql Server, I have a table containing 46 million rows.
In "Title" column of table, I want make search. The word may be at any index of field value.
For example:
Value in table: BROTHERS COMPANY
Search string: ROTHER
I want this search to match the given record. This is exactly what LIKE '%ROTHER%' do. However, LIKE '%%' usage should not be used on large tables because of performance issues. How can I achieve it?
Though I don't know your requirements, your best approach may be to challenge them. Middle-of-the-string searches are usually not very practical. If you can get your users to perform prefix searches (broth%) then you can easily use Full Text's wildcard search (CONTAINS(*, '"broth*"')). Full Text can also handle suffix searches (%rothers) with a little extra work.
But when it comes to middle-of-the-string searches with SQL Server, you're stuck using LIKE. However you may be able to improve performance of LIKE by using a binary collation as explained in this article. (I hate to post a link without including its content but it is way too long of an article to post here and I don't understand the approach enough to sum it up.)
If that doesn't help and if middle-of-the-string searches are that important of a requirement then you should consider using a different search solution like Lucene.
Add Full-Text index if you want.
You can search the table using CONTAINS:
SELECT *
FROM YourTable
WHERE CONTAINS(TableColumnName, 'SearchItem')
I have a query which slows down immensely when i add an addition where part
which essentially is just a like lookup on a varchar(500) field
where...
and (xxxxx.yyyy like '% blahblah %')
I've been racking my head but pretty much the query slows down terribly when I add this in.
I'm wondering if anyone has suggestions in terms of changing field type, index setup, or index hints or something that might assist.
any help appreciated.
sql 2000 enterprise.
HERE IS SOME ADDITIONAL INFO:
oops. as some background unfortunately I do need (in the case of the like statement) to have the % at the front.
There is business logic behind that which I can't avoid.
I have since created a full text catalogue on the field which is causing me problems
and converted the search to use the contains syntax.
Unfortunately although this has increased performance on occasion it appears to be slow (slower) for new word searchs.
So if i have apple.. apple appears to be faster the subsequent times but not for new searches of orange (for example).
So i don't think i can go with that (unless you can suggest some tinkering to make that more consistent).
Additional info:
the table contains only around 60k records
the field i'm trying to filter is a varchar(500)
sql 2000 on windows server 2003
The query i'm using is definitely convoluted
Sorry i've had to replace proprietary stuff.. but should give you and indication of the query:
SELECT TOP 99 AAAAAAAA.Item_ID, AAAAAAAA.CatID, AAAAAAAA.PID, AAAAAAAA.Description,
AAAAAAAA.Retail, AAAAAAAA.Pack, AAAAAAAA.CatID, AAAAAAAA.Code, BBBBBBBB.blahblah_PictureFile AS PictureFile,
AAAAAAAA.CL1, AAAAAAAA.CL1, AAAAAAAA.CL2, AAAAAAAA.CL3
FROM CCCCCCC INNER JOIN DDDDDDDD ON CCCCCCC.CID = DDDDDDDD.CID
INNER JOIN AAAAAAAA ON DDDDDDDD.CID = AAAAAAAA.CatID LEFT OUTER JOIN BBBBBBBB
ON AAAAAAAA.PID = BBBBBBBB.Product_ID INNER JOIN EEEEEEE ON AAAAAAAA.BID = EEEEEEE.ID
WHERE
(CCCCCCC.TID = 654321) AND (DDDDDDDD.In_Use = 1) AND (AAAAAAAA.Unused = 0)
AND (DDDDDDDD.Expiry > '10-11-2010 09:23:38') AND
(
(AAAAAAAA.Code = 'red pen') OR
(
(my_search_description LIKE '% red %') AND (my_search_description LIKE '% nose %')
AND (DDDDDDDD.CID IN (63,153,165,305,32,33))
)
)
AND (DDDDDDDD.CID IN (20,32,33,63,64,65,153,165,232,277,294,297,300,304,305,313,348,443,445,446,447,454,472,479,481,486,489,498))
ORDER BY AAAAAAAA.f_search_priority DESC, DDDDDDDD.Priority DESC, AAAAAAAA.Description ASC
You can see throwing in the my_search_description filter also includes a dddd.cid filter (business logic).
This is the part which is slowing things down (from a 1.5-2 second load of my pages down to a 6-8 second load (ow ow ow))
It might be my lack of understanding of how to have the full text search catelogue working.
Am very impressed by the answers so if anyone has any tips I'd be most greatful.
If you haven't already, enable full text indexing.
Unfortunately, using the LIKE clause on a query really does slow things down. Full Text Indexing is really the only way that I know of to speed things up (at the cost of storage space, of course).
Here's a link to an overview of Full-Text Search in SQL Server which will show you how to configure things and change your queries to take advantage of the full-text indexes.
More details would certainly help, but...
Full-text indexing can certainly be useful (depending on the more details about the table and your query). Full Text indexing requires a good bit of extra work both in setup and querying, but it's the only way to try to do the sort of search you seek efficiently.
The problem with LIKE that starts with a Wildcard is that SQL server has to do a complete table scan to find matching records - not only does it have to scan every row, but it has to read the contents of the char-based field you are querying.
With or without a full-text index, one thing can possibly help: Can you narrow the range of rows being searched, so at least SQL doesn't need to scan the whole table, but just some subset of it?
The '% blahblah %' is a problem for improving performance. Putting the wildcard at the beginning tells SQL Server that the string can begin with any legal character, so it must scan the entire index. Your best bet if you must have this filter is to focus on your other filters for improvement.
Using LIKE with a wildcard at the beginning of the search pattern forces the server to scan every row. It's unable to use any indexes. Indexes work from left to right, and since there is no constant on the left, no index is used.
From your WHERE clause, it looks like you're trying to find rows where a specific word exists in an entry. If you're searching for a whole word, then full text indexing may be a solution for you.
Full text indexing creates an index entry for each word that's contained in the specified column. You can then quickly find rows that contain a specific word.
As other posters have correctly pointed out, the use of the wildcard character % within the LIKE expression is resulting in a query plan being produced that uses a SCAN operation. A scan operation touches every row in the table or index, dependant on the type of scan operation being performed.
So the question really then becomes, do you actually need to search for the given text string anywhere within the column in question?
If not, great, problem solved but if it is essential to your business logic then you have two routes of optimization.
Really go to town on increasing the overall selectivity of your query by focusing your optimization efforts on the remaining search arguments.
Implement a Full Text Indexing Solution.
I don't think this is a valid answer, but I'd like to throw it out there for some more experienced posters comments...are these equivlent?
where (xxxxx.yyyy like '% blahblah %')
vs
where patindex(%blahbalh%, xxxx.yyyy) > 0
As far as I know, that's equivlent from a database logic standpoint as it's forcing the same scan. Guess it couldn't hurt to try?
I need to make a FuzzyQuery using an index that contains around 8 million lines. That kind of query is pretty slow, needing about 20 seconds for every match. The fact is that I can narrow down the results using another field to about 5000 hits before doing the fuzzy search. For this to work, I should be able to make a search by the "narrower" field first, and then use the fuzzy search within those results.
According to the lucene FAQ, the only thing I have to do is a BooleanQuery, where the "narrower" should be required (BooleanClause.Occur.MUST in lucene 3).
Now I have tried two different approaches:
a) Using the Query Parser, with an input like:
narrower:+narrowing_text fuzzy:fuzzy_text~0.9
b) Constructing a BooleanQuery with a TermQuery and a FuzzyQuery
Neither did work, I'm getting about the same times than the ones when the narrower is not used.
Also, just to check that if the narrower was working the times should be much better, I reindexed only the 5000 items that match the narrower, and the search went fast as hell.
In case anyone wonders, I'm using pylucene 3.0.2.
Doppleganger, you can probably use a Filter, specifically a QueryWrapperFilter.
Follow the example from Lucene in Action. You may have to make some modifications for use in python, but otherwise it should be simple:
Create the query that narrows this down to 5000 hits.
Use it to build a QueryWrapperFilter.
Use the filter in a search involving the fuzzy query.
I have a table Books in my MySQL database which has the columns Title (varchar(255)) and Edition (varchar(20)). Example values for these are "Introduction to Microeconomics" and "4".
I want to let users search for Books based on Title and Edition. So, for example they could enter "Microeconomics 4" and it would get the proper result. My question is how I should set this up on the database side.
I've been told that FULLTEXT search is generally a good way to do things like this. However, because the edition is sometimes just a single character ("4"), full text search would have to be setup to look at individual characters (ft_min_word_len = 1).. This, I've heard, is very inefficient.
So, how should I setup searches of this database?
UPDATE: I'm aware the CONCAT/LIKE could be used here.. My question is whether it would be too slow. My Books database has hundreds of thousands of books and a lot of users are going to be searching it..
here are the steps for solution
1) read the search string from user.
2) make the string in to parts according to space(" ") between the words.
3) use following query for getting the result
SELECT * FROM books WHERE Title LIKE '%part[0]%' AND Edition LIKE '%part[1]%';
here part[0] and part[1] are separated words from the given word
the PHP code for the above could be
<?php
$string_array=explode(" ",$string); //$string is the value we are searching
$select_query="SELECT * FROM books WHERE Title LIKE '%".$string_array[0]."%' AND Edition LIKE '%".$string_array[1]."%';";
$result=mysql_fetch_array(mysql_query($select_query));
?>
for $string_array[0] it could be extended to get all the parts except last one which can be applied for the case "Introduction to Microeconomics 4"
For your application, where you're interested in just title and edition, I suspect that using a FULLTEXT index with MATCH/AGAINST and reducing the ft_min_word_len to 1 would not have that much impact performance-wise (if you were data was more verbose or user written content, then I might hesitate).
The easiest way to check is to change the value, REPAIR the table to account for the new ft_min_word_len and rebuild the index, and do some simple benchmarking.
Having said that, for your application, I might consider looking into Sphinx. It's definitely going to be magnitudes faster, and your content is relatively static, so a delay between re-indexing (Sphinx's main drawback IMO) isn't an issue. Plus, with careful usage of the wordforms and exceptions, you could map things like 4/four/fourth/IV all to the same token for improved searching.
We are porting an app which formerly used Openbase 7 to now use MySQL 5.0.
OB 7 did have quite badly defined (i.e. undocumented) behavior regarding case-sensitivity. We only found this out now when trying the same queries with MySQL.
It appears that OB 7 treats lookups using "=" differently from those using "LIKE": If you have two values "a" and "A", and make a query with WHERE f="a", then it finds only the "a" field, not the "A" field. However, if you use LIKE instead of "=", then it finds both.
Our tests with MySQL showed that if we're using a non-binary collation (e.g. latin1), then both "=" and "LIKE" compare case-insensitively. However, to simulate OB's behavior, we need to get only "=" to be case-sensitive.
We're now trying to figure out how to deal with this in MySQL without having to add a lot of LOWER() function calls to all our queries (there are a lot!).
We have full control over the MySQL DB, meaning we can choose its collation mode as we like (our table names and unique indexes are not affected by the case sensitivity issues, fortunately).
Any suggestions how to simulate the OpenBase behaviour on MySQL with the least amount of code changes?
(I realize that a few smart regex replacements in our source code to add the LOWER calls might do the trick, but we'd rather find a different way)
Another idea .. does MySQL offer something like User Defined Functions? You could then write a UDF-version of like that is case insesitive (ci_like or so) and change all like's to ci_like. Probably easier to do than regexing a call to lower in ..
These two articles talk about case sensitivity in mysql:
Case Sensitive mysql
mySql docs "Case Sensitivity"
Both were early hits in this Google search:
case sensitive mysql
I know that this is not the answer you are looking for .. but given that you want to keep this behaviour, shouldn't you explicitly code it (rather than changing some magic 'config' somewhere)?
It's probably quite some work, but at least you'd know which areas of your code are affected.
A quick look at the MySQL docs seems to indicate that this is exactly how MySQL does it:
This means that if you search with col_name LIKE 'a%', you get all column values that start with A or a.