I'm using MS SQL Server 2008 R2 with Full Text Search for searching text data stored in different languages.
I'm a bit confused about how CONTAINS predicate works with accents.
When I use the following predicate
CONTAINS([Text], #keywords , Language #language)
on a catalog with ACCENT_SENSITIVITY = OFF the search results are the same for e.g. 'Lächeln' and 'lacheln' when Germany is specified as language.
But if I change the predicate to look like
CONTAINS([Text], FORMSOF(INFLECTIONAL, #keywords) , Language #language)
then results are different and it seems to me that Accent Insensitivity doesn't work with FORMSOF
I've tried to find an answer on MSDN and Google but didn't find anything useful.
Does anybody know why the results are different?
Thanks!
My understanding is that these serve two separate purposes in finding matches for a full-text search. With an accent insensitive catalog there is a simple character equality performed for the term matching so that eñya = enya because 'n' is considered the accent insensitive equivalent of 'ñ'.
With FORMSOF you're requesting that the search perform a stemming operation on the terms so that verb and noun forms will be searched as additional terms in the search. e.g. searching for 'foot' would include 'feet' and 'run' would include 'ran'.
If the FORMSOF seems to be fundamentally not working for your values you may want to make sure that you have the appropriate language support installed for full-text languages.
SELECT * FROM sys.fulltext_languages
If you haven't had a chance to review MSDN the SQL Word Breakers documentation may shed some light on the observed behavior. http://msdn.microsoft.com/en-us/library/ms142509.aspx
FORMSOF cuts diacritics from Your word:
SELECT * FROM sys.dm_fts_parser(N'FORMSOF(INFLECTIONAL, "Lächeln")', 1031, 0, 1)
check column "display_term".
Related
I've been searching on the internet and found the Full-Text Search usually have a better performance.
I followed the instructions on this post to set up thesaurus tables on my machine so I can play around with it and get more familiar with full-text search.
I am viewing everything in Microsoft SQL Server Management Studio 2008.
When I run the queries. I notice that my LIKE search was faster than my FREETEXT search, which contradict what I found on most wiki sites/pages.
Below are the query I ran:
select *
from TheThesaurus
where freetext(TheDefinition, 'aspire')
select *
from TheThesaurus
where TheDefinition like '%aspire%'
The LIKE search took 0sec, where the FREETEXT search took 6sec.
The LIKE search returns 70 rows, where FREETEXT search returns 94, which makes FREETEXT search more accurate and better result.
Is there something I'm missing that cause the FREETEXT search to be mush slower than the LIKE search?
I would really like to use FREETEXT search in my program because it returns more hits (collect more data), but the speed was a significant issue.
Thanks for the help!
Have you created a full text index? If not, see CREATE FULLTEXT CATALOG at the MSDN site or this link walks you through it using SQL 2008 http://www.codeproject.com/Articles/29237/SQL-SERVER-2008-Creating-Full-Text-Catalog-and-Ful
Another reason the run times would be different has to do with what the predicates are doing. The LIKE is closer to an exact match. The FREETEXT function "searches for the values that match the meaning of a phrase and not just exact words" so your FREETEXT command is doing more work. That is from "Querying Microsoft SQL Server 2012"
I've got stored procedure that performs search using full-text indexes in general case. But I can't build full-text index for one field, and I need to use LIKE construction.
So, the problem is: parameter could be
"a*" or "b*"
like parameter for CONTAINS command.
Сan anyone give a good solution, how to transform this parameter for LIKE construction.
Thank you.
P.S: I use MSSQL Server
Depending on the full-text search constructs you want to support, this is generally impossible.
According to MSDN, full-text search syntax on SQL Server supports these constructs:
One or more specific words or phrases (simple term)
something along LIKE '%[,;.-()!? ]Term[,;.-()!? ]%'
A word or a phrase where the words begin with specified text (prefix term)
something along LIKE '%[,;.-()!? ]Term%'
Inflectional forms of a specific word (generation term)
Not possible
A word or phrase close to another word or phrase (proximity term)
Not possible
Synonymous forms of a specific word (thesaurus)
Not possible
Words or phrases using weighted values (weighted term)
Not possible
Those which I have marked "not possible" can't really be translated to LIKE queries, but of course you could get inventive (using your own stemming algorithm for inflectional forms, or your own thesaurus for synonyms) to support at least some of those.
In the end, you will probably need to use dynamic SQL.
Here is a way you can get the correct WHERE clause, given that input:
declare #str varchar(255) = '"a*" or "b*"';
with const as (select 'col' as col)
select col+' like '+replace(replace(REPLACE(#str, '"', ''''), '*', '%'), 'or ', 'or '+COL+' like ') as WhereClause
from const
The "const" is just a table with one column to specify your column name. It allows it to be specified in one place.
This just does replaces to get the correct syntax for LIKE. Of course, this would be more complex to support more functionality from CONTAINS.
Thanks to everyone!
Unfortunately expression parsing is not enough for general case.
I use regular expressions in MS SQL SERVER
http://anastasiosyal.com/POST/2008/07/05/REGULAR-EXPRESSIONS-IN-MS-SQL-SERVER-USING-CLR.ASPX
Given your data stored somewhere in a database:
Hello my name is Tom I like dinosaurs to talk about SQL.
SQL is amazing. I really like SQL.
We want to implement a site search, allowing visitors to enter terms and return relating records. A user might search for:
Dinosaurs
And the SQL:
WHERE articleBody LIKE '%Dinosaurs%'
Copes fine with returning the correct set of records.
How would we cope however, if a user mispells dinosaurs? IE:
Dinosores
(Poor sore dino). How can we search allowing for error in spelling? We can associate common misspellings we see in search with the correct spelling, and then search on the original terms + corrected term, but this is time consuming to maintain.
Any way programatically?
Edit
Appears SOUNDEX could help, but can anyone give me an example using soundex where entering the search term:
Dinosores wrocks
returns records instead of doing:
WHERE articleBody LIKE '%Dinosaurs%' OR articleBody LIKE '%Wrocks%'
which would return squadoosh?
If you're using SQL Server, have a look at SOUNDEX.
For your example:
select SOUNDEX('Dinosaurs'), SOUNDEX('Dinosores')
Returns identical values (D526) .
You can also use DIFFERENCE function (on same link as soundex) that will compare levels of similarity (4 being the most similar, 0 being the least).
SELECT DIFFERENCE('Dinosaurs', 'Dinosores'); --returns 4
Edit:
After hunting around a bit for a multi-text option, it seems that this isn't all that easy. I would refer you to the link on the Fuzzt Logic answer provided by #Neil Knight (+1 to that, for me!).
This stackoverflow article also details possible sources for implentations for Fuzzy Logic in TSQL. Once respondant also outlined Full text Indexing as a potential that you might want to investigate.
Perhaps your RDBMS has a SOUNDEX function? You didn't mention which one was involved here.
SQL Server's SOUNDEX
Just to throw an alternative out there. If SSIS is an option, then you can use Fuzzy Lookup.
SSIS Fuzzy Lookup
I'm not sure if introducing a separate "search engine" is possible, but if you look at products like the Google search appliance or Autonomy, these products can index a SQL database and provide more searching options - for example, handling misspellings as well as synonyms, search results weighting, alternative search recommendations, etc.
Also, SQL Server's full-text search feature can be configured to use a thesaurus, which might help:
http://msdn.microsoft.com/en-us/library/ms142491.aspx
Here is another SO question from someone setting up a thesaurus to handle common misspellings:
FORMSOF Thesaurus in SQL Server
Short answer, there is nothing built in to most SQL engines that can do dictionary-based correction of "fat fingers". SoundEx does work as a tool to find words that would sound alike and thus correct for phonetic misspellings, but if the user typed in "Dinosars" missing the final U, or truly "fat-fingered" it and entered "Dinosayrs", SoundEx would not return an exact match.
Sounds like you want something on the level of Google Search's "Did you mean __?" feature. I can tell you that is not as simple as it looks. At a 10,000-foot level, the search engine would look at each of those keywords and see if it's in a "dictionary" of known "good" search terms. If it isn't, it uses an algorithm much like a spell-checker suggestion to find the dictionary word that is the closest match (requires the fewest letter substitutions, additions, deletions and transpositions to turn the given word into the dictionary word). This will require some heavy procedural code, either in a stored proc or CLR Db function in your database, or in your business logic layer.
You can also try the SubString(), to eliminate the first 3 or so characters . Below is an example of how that can be achieved
SELECT Fname, Lname
FROM Table1 ,Table2
WHERE substr(Table1.Fname, 1,3) || substr(Table1.Lname,1 ,3) = substr(Table2.Fname, 1,3) || substr(Table2.Lname, 1 , 3))
ORDER BY Table1.Fname;
I'm using the SQL Full-Text Search and have a stored proceedure that uses the FREETEXTTABLE function.
This all works great, however, I have noticed that if I search for something such as 'Chapter 19' the 19 seems as if it is thrown away and the search only searches on 'Chapter'.
Also if I search for just '19' I get no results. I know the columns I have indexed contain a '19' in multiple rows.
Is this the intended behaviour? To not index numerics?
If so, then I suppose I'll have to live with it, but if not I'll be happy to post any T-SQL if anyone thinks I'm doing anything wrong.
Thanks.
P.S. I've googled this and have found nothing on searching numerics will full-text search.
I eventually found the reason behind this.
Numerics are considered as noise words in SQL server. You can allow searching on numerics by removing the numeric entries in the appropriate noise file for your language.
Noise files are found at in the FTData directoraty of your SQL Server install.
The english noise files are: noiseENU.txt & noiseENG.txt
Hope this helps someone.
I seem to have a weird bug in Microsoft SQL Server 2005 where FREETEXT() searches are somewhat case-sensitive despite the collation being case-insensitive (Latin1_General_CI_AS).
First of, LIKE queries are perfectly case-insensitive, so
WHERE column LIKE '%word%'
and
WHERE column LIKE '%Word%'
return the same results.
Also, FREETEXT are infact case-insensitive to some extent, for instance
WHERE FREETEXT(column, 'Word')
will return results with different cases.
BUT
WHERE FREETEXT(column, 'word')
while still returning case-insensitive matches for word, gives a different resultset.
Or, as I found out after some investigation, searching for word gives all matches for different cases of word but searching for Word gives the same PLUS inflectional results.
Or to use one of the actual cases I found, searching for marketingleader returns all results containing that word, independent of the case, whereas searching for Marketingleader would return those, but also results that just contain leader that don't show up when searching for the lower case.
has anyone got any Idea as to what is causing this and how I could turn on inflectional/fuzzy searching for lower-case words as well?
Any help would be appreciated.
Use the alternative to freetext which is contains and the inflectional results are optional ..
CONTAINS (Transact-SQL)
.. oups just saw that you mention contains in your question, but does it behave the same way as the freetext in the provided examples ?