I have below values in my database.
been Lorem Ipsum and scrambled ever
scrambledtexttextofandtooktooktypetexthastheunknownspecimenstandardsincetypesett
Here is my query:
SELECT
nBusinessAdID,
MATCH (`sHeadline`) AGAINST ("text" IN BOOLEAN MODE) AS score
FROM wiki_businessads
WHERE MATCH (`sHeadline`) AGAINST ("text" IN BOOLEAN MODE)
AND bDeleted ="0" AND nAdStatus ="1"
ORDER BY score DESC, bPrimeListing DESC, dDateCreated DESC
It's not fetching first result, why? It should fetch first result because its contain text word in it. I have disabled the stopword filtering.
This one is also not working
SELECT
nBusinessAdID,
MATCH (`sHeadline`) AGAINST ('"text"' IN BOOLEAN MODE) AS score
FROM wiki_businessads
WHERE MATCH (`sHeadline`) AGAINST ('"text"' IN BOOLEAN MODE)
AND bDeleted ="0" AND nAdStatus ="1"
ORDER BY score DESC, bPrimeListing DESC, dDateCreated DESC
The full text search only matches words and word prefixes. Because your data in the database does not contain word boundaries (spaces) the words are not indexed, so they are not found.
Some possible choices you could make are:
Fix your data so that it contains spaces between words.
Use LIKE '%text%' instead of a full text search.
Use an external full-text search engine.
I will expand on each of these in turn.
Fix your data so that it contains spaces between words.
Your data seems to have been corrupted somehow. It looks like words or sentences but with all the spaces removed. Do you know how that happened? Was it intentional? Perhaps there is a bug elsewhere in the system. Try to fix that. Find out where the data came from and see if it can be reimported correctly.
If the original source doesn't contain spaces, perhaps you could use some natural language toolkit to guess where the spaces should be and insert them. There most likely already exist libraries that can do this, although I don't happen to know any. A Google search might find something.
Use LIKE '%text%' instead of a full text search.
A workaround is to use LIKE '%text%' instead but note that this will be much slower as it will not be able to use the index. However it will give the correct result.
Use an external full-text search engine.
You could also look at Lucene or Sphinx. For example I know that Sphinx supports finding text using *text*. Here is an extract from the documentation which explains how to enable infix searching, which is what you need.
9.2.16. min_infix_len
Minimum infix prefix length to index. Optional, default is 0 (do not index infixes).
Infix indexing allows to implement wildcard searching by 'start*', '*end', and 'middle' wildcards (refer to enable_star option for details on wildcard syntax). When mininum infix length is set to a positive number, indexer will index all the possible keyword infixes (ie. substrings) in addition to the keywords themselves. Too short infixes (below the minimum allowed length) will not be indexed.
For instance, indexing a keyword "test" with min_infix_len=2 will result in indexing "te", "es", "st", "tes", "est" infixes along with the word itself. Searches against such index for "es" will match documents that contain "test" word, even if they do not contain "es" on itself. However, indexing infixes will make the index grow significantly (because of many more indexed keywords), and will degrade both indexing and searching times.
Related
In the Microsoft SQL Server, our searches are limited to starting words when we use a full-text search to search for values. That is, we cannot search contains the word looks like the LIKE operator in the middle.
I try to execute this query but the result is not my opinion.
I want to search for the middle of the term. For example, if my term is "Microsoft" and my query is :
SELECT *
FROM dbo.SMS_Outbox
WHERE CONTAINS(MessageText, N'"*soft*"')
There is no result returned!
The documentation is quite clear that wildcards are allowed only at the end of search terms:
The CONTAINS predicate supports the use of the asterisk (*) as a wildcard character to represent words and phrases. You can add the asterisk only at the end of the word or phrase. The presence of the asterisk enables the prefix-matching mode. In this mode, matches are returned if the column contains the specified search word followed by zero or more other characters.
You cannot do what you want easily. One simple option is to switch to LIKE and take the performance hit:
WHERE MessageText LIKE N'%soft%'
Another option might be to parse your text in such a way that soft is always at the beginning of a search term.
I are trying to search an FTI using CONTAINS for Twitter-style usernames, e.g. #username, but word breakers will ignore the # symbol. Is there any way to disable word breakers? From research, there is a way to create a custom word breaker DLL and install it and assign it but that all seems a bit intensive and, frankly, over my head. I disabled stop words so that dashes are not ignored but I need that # symbol. Any ideas?
You're not going to like this answer. But full text indexes only consider the characters _ and ` while indexing. All the other characters are ignored and the words get split where these characters occur. This is mainly because full text indexes are designed to index large documents and there only proper words are considered to make it a more refined search.
We faced a similar problem. To solve this we actually had a translation table, where characters like #,-, / were replaced with special sequences like '`at`','`dash`','`slash`' etc. While searching in the full text, u've to again replace ur characters in the search string with these special sequences and search. This should take care of the special characters.
I am trying to index in Lucene a field that could have RDF literal in different languages.
Most of the approaches I have seen so far are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.
It depends.
Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.
In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).
so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese
I'm trying to get my full text search (in boolean mode) to retrieve words with three letters or less.
Currently, if I search for something like "NBA", I don't get any results.
However, if I append the wild card operator "*" to the search term, I get results.
I also read that you could remove the three word limit in my.ini, but I'm wondering if there was a better way to do this on the fly.
This section of the manual might interest you : 11.8.6. Fine-Tuning MySQL Full-Text Search (quoting a portion of it) :
The minimum and maximum lengths of
words to be indexed are defined by the
ft_min_word_len and ft_max_word_len
system variables. The
default minimum value is four
characters; the default maximum is
version dependent. If you change
either value, you must rebuild your
FULLTEXT indexes. For example, if you
want three-character words to be
searchable, you can set the
ft_min_word_len variable by putting
the following lines in an option file:
[mysqld]
ft_min_word_len=3
Then you must restart the server and
rebuild your FULLTEXT indexes.
(You should read that page, for more informations I didn't copy-paste ;-) )
MySQL can take care of fulltext search quite well,
but it doesn't highlight the keywords that are searched,
how can I do this most efficiently?
The solutions posed above require retrieval of the entire document in order to search, replace, and highlight text. If the document is large, and many are, this seems like a really bad idea. Better would be for MySQL FTS to return the text offsets directly like SQLITE does, then use an indexed substring operator - that would be significantly more efficient.
Do your sql query and then do a preg_replace on the result, replacing each keyword with KeyWord
$hilitedText = preg_replace ( '/keyword/' , '/<span class="hilite">keyword<\/span>/' , $row['columName']);
And define the hilite class in your css to format however you want the hilighted keywords to appear.
If you have multiple keywords put them in an array and their replacements in a second array in the same order and pass those arrays as the firt two arguments of the function.
Get the result set from mysql. Do a search and replace for each search word, replacing each word with whatever you're doing for highlighting, e.g., <span class='highlight'>word</span>