SQL: transform full-text search into like construction - sql

I've got stored procedure that performs search using full-text indexes in general case. But I can't build full-text index for one field, and I need to use LIKE construction.
So, the problem is: parameter could be
"a*" or "b*"
like parameter for CONTAINS command.
Сan anyone give a good solution, how to transform this parameter for LIKE construction.
Thank you.
P.S: I use MSSQL Server

Depending on the full-text search constructs you want to support, this is generally impossible.
According to MSDN, full-text search syntax on SQL Server supports these constructs:
One or more specific words or phrases (simple term)
something along LIKE '%[,;.-()!? ]Term[,;.-()!? ]%'
A word or a phrase where the words begin with specified text (prefix term)
something along LIKE '%[,;.-()!? ]Term%'
Inflectional forms of a specific word (generation term)
Not possible
A word or phrase close to another word or phrase (proximity term)
Not possible
Synonymous forms of a specific word (thesaurus)
Not possible
Words or phrases using weighted values (weighted term)
Not possible
Those which I have marked "not possible" can't really be translated to LIKE queries, but of course you could get inventive (using your own stemming algorithm for inflectional forms, or your own thesaurus for synonyms) to support at least some of those.

In the end, you will probably need to use dynamic SQL.
Here is a way you can get the correct WHERE clause, given that input:
declare #str varchar(255) = '"a*" or "b*"';
with const as (select 'col' as col)
select col+' like '+replace(replace(REPLACE(#str, '"', ''''), '*', '%'), 'or ', 'or '+COL+' like ') as WhereClause
from const
The "const" is just a table with one column to specify your column name. It allows it to be specified in one place.
This just does replaces to get the correct syntax for LIKE. Of course, this would be more complex to support more functionality from CONTAINS.

Thanks to everyone!
Unfortunately expression parsing is not enough for general case.
I use regular expressions in MS SQL SERVER
http://anastasiosyal.com/POST/2008/07/05/REGULAR-EXPRESSIONS-IN-MS-SQL-SERVER-USING-CLR.ASPX

Related

SQL Server full text search and spaces

I have a column with a product names. Some names look like ‘ab-cd’ ‘ab cd’
Is it possible to use full text search to get these names when user types ‘abc’ (without spaces) ? The like operator is working for me, but I’d like to know if it’s possible to use full text search.
If you want to use FTS to find terms that are adjacent to each other, like words separated by a space you should use a proximity term.
You can define a proximity term by using the NEAR keyword or the ~ operator in the search expression, as documented here.
So if you want to find ab followed immediately by cd you could use the expression,
'NEAR((ab,cd), 0)'
searching for the word ab followed by the word cd with 0 terms in-between.
No, unfortunately you cannot make such search via full-text. You can only use LIKE in that case LIKE ('ab%c%')
EDIT1:
You can create a view (WITH SCHEMABINDING!) with some id and column name in which you want to search:
CREATE VIEW dbo.ftview WITH SCHEMABINDING
AS
SELECT id,
REPLACE(columnname,' ','') as search_string
FROM YourTable
Then create index
CREATE UNIQUE CLUSTERED INDEX UCI_ftview ON dbo.ftview (id ASC)
Then create full-text search index on search_string field.
After that you can run CONTAINS query with "abc*" search and it will find what you need.
EDIT2:
But it wont help if search_string does not start with your search term.
For example:
ab c d -> abcd and you search cd
No. Full Text Search is based on WORDS and Phrases. It does not store the original text. In fact, depending on configuration it will not even store all words - there are so called stop words that never go into the index. Example: in english the word "in" is not selective enough to be considered worth storing.
Some names look like ‘ab-cd’ ‘ab cd’
Those likely do not get stored at all. At least the 2nd example is actually 2 extremely short words - quite likely they get totally ignored.
So, no - full text search is not suitable for this.

SQL Server Full Text Search matches part of a word, even without wildcard

Take this query:
SELECT * FROM Books
WHERE CONTAINS(([Description], ReverseDescription), '"øgle"')
And these two text for the columns being search:
http://textuploader.com/5bg5r
http://textuploader.com/5bg59
Why does that one match? I cannot for find an exact match in either of those texts. And as far as I know only the partial match should show up if I use the following query:
SELECT * FROM Books
WHERE CONTAINS(([Description], ReverseDescription), '"øgle*"')
Anyone know what's going on?
Full-Text works on selected language grammar and vocabulary basis, not on simple character comparison like LIKE would do. Each language defines stemmers and word breakers. I can't say weather øgle is a full word by itself and how is your FT index treating that ø. My suspicion is that your index is not created with Danish language rules. If your index is indeed using the correct language, then you need to check the stemmer and breakers rules in use for that language.
Update
Actually I think is simpler. The presence of "" makes the search term a prefix term, event without an *. MSDN is a bit ambiguous here, because for example in Performing Prefix Searches it states:
When the prefix term is a phrase, each token making up the phrase is considered a separate prefix term. All rows that have words beginning with the prefix terms will be returned. For example, the prefix term "light bread*" will find rows with text of either "light breaded," "lightly breaded," or "light bread", but will not return "Lightly toasted bread".
Note how light in the example is a prefix and does not require light*. I do not have a system to test, so is a bit of speculation on my side, but I suspect that CONTAINS will consider "øgle" as a case insensitive prefix search and then your text contains two matches for Øgledronning and Øgledronningens.
Change COLLATE Latin1_General_CS_AS
for example query will look like
SELECT * FROM Books
WHERE CONTAINS(([Description], ReverseDescription), '"øgle*"')
AND [Description] COLLATE Latin1_General_CS_AS LIKE '%"øgle*"%'

How to find strings which are similar to given string in SQL server?

I have a SQL server table which contains several string columns. I need to write an application which gets a string and search for similar strings in SQL server table.
For example, if I give the "مختار" or "مختر" as input string, I should get these from SQL table:
1 - مختاری
2 - شهاب مختاری
3 - شهاب الدین مختاری
I've searched the net for a solution but I have found nothing useful. I've read this question , but this will not help me because:
I am using MS SQL Server not MySQL
my table contents are in Persian, so I can't use Levenshtein distance and similar methods
I prefer an SQL Server only solution, not an indexing or daemon based solution.
The best solution would be a solution which help us sort result by similarity, but, its optional.
Do you have any suggestion for that?
Thanks
MSSQL supports LIKE which seems like it should work. Is there a reason it's not suitable for your program?
SELECT * FROM table WHERE input LIKE '%مختار%'
Hmm.. considering that you read the other post you probably know about the like operator already... maybe your problem is "getting the string and searching for something similar"?
--This part searches for a string you want
declare #MyString varchar(max)
set #MyString = (Select column from table
where **LOGIC TO FIND THE STRING GOES HERE**)
--This part searches for that string
select searchColumn, ABS(Len(searchColumn) - Len(#MyString)) as Similarity
from table where data LIKE '%' + #MyString + '%'
Order by Similarity, searchColumn
The similarity part is something like the thing you posted. If the strings are "more similar" meaning that they have a similar length, they will be higher on the results query.
The absolute part can be avoided obviously but I did it just in case.
Hope that helps =-)
Besides like operator, you can use the condition WHERE instr(columnname, search) > 0; however this is generally slower. What it does is return the starting position of a string within another string. thus if searching in ABCDEFG for CD it would return 3. 3>0, so the record would be returned. However in the case you've described, like seems to be the best solution.
The general problem is that in languages where the same letter has different writing form in the beginning, middle and at the end of word, and thus - different codes - we can try to use specific Persian collations, but in general this will not help.
The second option - is to use SQL FTS abilities, but again - if it has not special language module for the language - it is much less useful.
And most general way - to use your own language processing - which is very complex task at all. The next keywords and google can help to understand the size of the problem: DLP, words and terms, bi-gramms, n-gramms, grammar and morphology inflection
Try to use the Built-in Soundex() And Difference() functions. I hope they work fine for Persian.
Look at the following reference:
http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/
Similarity() function helps you to sort result by similarity (as you asked in your question) and it is also possible using algorithms different from Levenshtein edit distance depends on the Value for #method Algorithm:
0 The Levenshtein edit distance algorithm
1 The Jaccard similarity coefficient algorithm
2 A form of the Jaro-Winkler distance algorithm
3 Longest common subsequence algorithm
Like operator may not do what he is asking for. Like for example, if i have a record value "please , i want to ask a question' in my database record. and lets say on my query, i want to find a match similarity like this 'Can i ask a question, please'. like operator may do this using like %[your senttence] or [your sentence]% but it is not advisable to use it for string similarity cos sentences may change and all your like logic may not fetch the matching records. It is advisable to use naive bayes text classification for similarities assigning labels to your sentences or you can try the semantic search function in MSSQL server

SQL - searching database with the LIKE operator

Given your data stored somewhere in a database:
Hello my name is Tom I like dinosaurs to talk about SQL.
SQL is amazing. I really like SQL.
We want to implement a site search, allowing visitors to enter terms and return relating records. A user might search for:
Dinosaurs
And the SQL:
WHERE articleBody LIKE '%Dinosaurs%'
Copes fine with returning the correct set of records.
How would we cope however, if a user mispells dinosaurs? IE:
Dinosores
(Poor sore dino). How can we search allowing for error in spelling? We can associate common misspellings we see in search with the correct spelling, and then search on the original terms + corrected term, but this is time consuming to maintain.
Any way programatically?
Edit
Appears SOUNDEX could help, but can anyone give me an example using soundex where entering the search term:
Dinosores wrocks
returns records instead of doing:
WHERE articleBody LIKE '%Dinosaurs%' OR articleBody LIKE '%Wrocks%'
which would return squadoosh?
If you're using SQL Server, have a look at SOUNDEX.
For your example:
select SOUNDEX('Dinosaurs'), SOUNDEX('Dinosores')
Returns identical values (D526) .
You can also use DIFFERENCE function (on same link as soundex) that will compare levels of similarity (4 being the most similar, 0 being the least).
SELECT DIFFERENCE('Dinosaurs', 'Dinosores'); --returns 4
Edit:
After hunting around a bit for a multi-text option, it seems that this isn't all that easy. I would refer you to the link on the Fuzzt Logic answer provided by #Neil Knight (+1 to that, for me!).
This stackoverflow article also details possible sources for implentations for Fuzzy Logic in TSQL. Once respondant also outlined Full text Indexing as a potential that you might want to investigate.
Perhaps your RDBMS has a SOUNDEX function? You didn't mention which one was involved here.
SQL Server's SOUNDEX
Just to throw an alternative out there. If SSIS is an option, then you can use Fuzzy Lookup.
SSIS Fuzzy Lookup
I'm not sure if introducing a separate "search engine" is possible, but if you look at products like the Google search appliance or Autonomy, these products can index a SQL database and provide more searching options - for example, handling misspellings as well as synonyms, search results weighting, alternative search recommendations, etc.
Also, SQL Server's full-text search feature can be configured to use a thesaurus, which might help:
http://msdn.microsoft.com/en-us/library/ms142491.aspx
Here is another SO question from someone setting up a thesaurus to handle common misspellings:
FORMSOF Thesaurus in SQL Server
Short answer, there is nothing built in to most SQL engines that can do dictionary-based correction of "fat fingers". SoundEx does work as a tool to find words that would sound alike and thus correct for phonetic misspellings, but if the user typed in "Dinosars" missing the final U, or truly "fat-fingered" it and entered "Dinosayrs", SoundEx would not return an exact match.
Sounds like you want something on the level of Google Search's "Did you mean __?" feature. I can tell you that is not as simple as it looks. At a 10,000-foot level, the search engine would look at each of those keywords and see if it's in a "dictionary" of known "good" search terms. If it isn't, it uses an algorithm much like a spell-checker suggestion to find the dictionary word that is the closest match (requires the fewest letter substitutions, additions, deletions and transpositions to turn the given word into the dictionary word). This will require some heavy procedural code, either in a stored proc or CLR Db function in your database, or in your business logic layer.
You can also try the SubString(), to eliminate the first 3 or so characters . Below is an example of how that can be achieved
SELECT Fname, Lname
FROM Table1 ,Table2
WHERE substr(Table1.Fname, 1,3) || substr(Table1.Lname,1 ,3) = substr(Table2.Fname, 1,3) || substr(Table2.Lname, 1 , 3))
ORDER BY Table1.Fname;

Sql Server 2005 Fulltext case sensitivity problem

I seem to have a weird bug in Microsoft SQL Server 2005 where FREETEXT() searches are somewhat case-sensitive despite the collation being case-insensitive (Latin1_General_CI_AS).
First of, LIKE queries are perfectly case-insensitive, so
WHERE column LIKE '%word%'
and
WHERE column LIKE '%Word%'
return the same results.
Also, FREETEXT are infact case-insensitive to some extent, for instance
WHERE FREETEXT(column, 'Word')
will return results with different cases.
BUT
WHERE FREETEXT(column, 'word')
while still returning case-insensitive matches for word, gives a different resultset.
Or, as I found out after some investigation, searching for word gives all matches for different cases of word but searching for Word gives the same PLUS inflectional results.
Or to use one of the actual cases I found, searching for marketingleader returns all results containing that word, independent of the case, whereas searching for Marketingleader would return those, but also results that just contain leader that don't show up when searching for the lower case.
has anyone got any Idea as to what is causing this and how I could turn on inflectional/fuzzy searching for lower-case words as well?
Any help would be appreciated.
Use the alternative to freetext which is contains and the inflectional results are optional ..
CONTAINS (Transact-SQL)
.. oups just saw that you mention contains in your question, but does it behave the same way as the freetext in the provided examples ?