I've been asked to put together a search for one of our databases.
The criteria is the user types into a search box, SQL then needs to split up all the words in the search and search for each of them across multiple fields (Probably 2 or 3), it then needs to weight the results for example the result where all the words appear will be the top result and if only 1 word appears it will be weighted lower.
For example if you search for "This is a demo post"
The results would be ranked like this
Rank Field1 Field2
1: "This is a demo post" ""
2: "demo post" ""
3: "demo" "post"
4: "post" ""
Hope that makes some sort of sense, its kind of a base Google like search.
Anyway I can think of doing this is very messy.
Any suggestions would be great.
"Google-like search" means: fulltext search. Check it out!
Understanding fulltext indexing on SQL Server
Understanding SQL Server full-text indexing
Getting started with SQL Server 2005 fulltext searching
SQL Server fulltext search: language features
With SQL Server 2008, it's totally integrated into the SQL Server engine.
Before that, it was a bit of a quirky add-on. Another good reason to upgrade to SQL Server 2008! (and the SP1 is out already, too!)
Marc
Logically you can do this reasonably easily, although it may get hard to optimise - especially if someone uses a particularly long phrase.
Here's a basic example based on a table I have to hand...
SELECT TOP 100 Score, Forename FROM
(
SELECT
CASE
WHEN Forename LIKE '%Kerry James%' THEN 100
WHEN Forename LIKE '%Kerry%' AND Forename LIKE '%James%' THEN 75
WHEN Forename LIKE '%Kerry%' THEN 50
WHEN Forename LIKE '%James%' THEN 50
END AS Score,
Forename
FROM
tblPerson
) [Query]
WHERE
Score > 0
ORDER BY
Score DESC
In this example, I'm saying that an exact match is worth 100, a match with both terms (but not together) is worth 75 and a match of a single word is worth 50. You can make this as complicated as you wish and even include SOUNDEX matches too - but this is a simple example to point you in the right direction.
I ended up creating a full text index on the table and joining my search results to FREETEXTTABLE, allowing me to see the ranked value of each result
The SQL ended up looking something like this
SELECT
Msgs.RecordId,
Msgs.Title,
Msgs.Body
FROM
[Messages] AS Msgs
INNER JOIN FREETEXTTABLE([Messages],Title,#SearchText) AS TitleRanks ON Msgs.RecordId = TitleRanks.[Key]
ORDER BY
TitleRanks.[Key] DESC
I've used full text indexes in the past but never realised you could use FullTextTable like that, was very impressed with how easy it was to code and how well it works.
Related
Given your data stored somewhere in a database:
Hello my name is Tom I like dinosaurs to talk about SQL.
SQL is amazing. I really like SQL.
We want to implement a site search, allowing visitors to enter terms and return relating records. A user might search for:
Dinosaurs
And the SQL:
WHERE articleBody LIKE '%Dinosaurs%'
Copes fine with returning the correct set of records.
How would we cope however, if a user mispells dinosaurs? IE:
Dinosores
(Poor sore dino). How can we search allowing for error in spelling? We can associate common misspellings we see in search with the correct spelling, and then search on the original terms + corrected term, but this is time consuming to maintain.
Any way programatically?
Edit
Appears SOUNDEX could help, but can anyone give me an example using soundex where entering the search term:
Dinosores wrocks
returns records instead of doing:
WHERE articleBody LIKE '%Dinosaurs%' OR articleBody LIKE '%Wrocks%'
which would return squadoosh?
If you're using SQL Server, have a look at SOUNDEX.
For your example:
select SOUNDEX('Dinosaurs'), SOUNDEX('Dinosores')
Returns identical values (D526) .
You can also use DIFFERENCE function (on same link as soundex) that will compare levels of similarity (4 being the most similar, 0 being the least).
SELECT DIFFERENCE('Dinosaurs', 'Dinosores'); --returns 4
Edit:
After hunting around a bit for a multi-text option, it seems that this isn't all that easy. I would refer you to the link on the Fuzzt Logic answer provided by #Neil Knight (+1 to that, for me!).
This stackoverflow article also details possible sources for implentations for Fuzzy Logic in TSQL. Once respondant also outlined Full text Indexing as a potential that you might want to investigate.
Perhaps your RDBMS has a SOUNDEX function? You didn't mention which one was involved here.
SQL Server's SOUNDEX
Just to throw an alternative out there. If SSIS is an option, then you can use Fuzzy Lookup.
SSIS Fuzzy Lookup
I'm not sure if introducing a separate "search engine" is possible, but if you look at products like the Google search appliance or Autonomy, these products can index a SQL database and provide more searching options - for example, handling misspellings as well as synonyms, search results weighting, alternative search recommendations, etc.
Also, SQL Server's full-text search feature can be configured to use a thesaurus, which might help:
http://msdn.microsoft.com/en-us/library/ms142491.aspx
Here is another SO question from someone setting up a thesaurus to handle common misspellings:
FORMSOF Thesaurus in SQL Server
Short answer, there is nothing built in to most SQL engines that can do dictionary-based correction of "fat fingers". SoundEx does work as a tool to find words that would sound alike and thus correct for phonetic misspellings, but if the user typed in "Dinosars" missing the final U, or truly "fat-fingered" it and entered "Dinosayrs", SoundEx would not return an exact match.
Sounds like you want something on the level of Google Search's "Did you mean __?" feature. I can tell you that is not as simple as it looks. At a 10,000-foot level, the search engine would look at each of those keywords and see if it's in a "dictionary" of known "good" search terms. If it isn't, it uses an algorithm much like a spell-checker suggestion to find the dictionary word that is the closest match (requires the fewest letter substitutions, additions, deletions and transpositions to turn the given word into the dictionary word). This will require some heavy procedural code, either in a stored proc or CLR Db function in your database, or in your business logic layer.
You can also try the SubString(), to eliminate the first 3 or so characters . Below is an example of how that can be achieved
SELECT Fname, Lname
FROM Table1 ,Table2
WHERE substr(Table1.Fname, 1,3) || substr(Table1.Lname,1 ,3) = substr(Table2.Fname, 1,3) || substr(Table2.Lname, 1 , 3))
ORDER BY Table1.Fname;
I have a query which slows down immensely when i add an addition where part
which essentially is just a like lookup on a varchar(500) field
where...
and (xxxxx.yyyy like '% blahblah %')
I've been racking my head but pretty much the query slows down terribly when I add this in.
I'm wondering if anyone has suggestions in terms of changing field type, index setup, or index hints or something that might assist.
any help appreciated.
sql 2000 enterprise.
HERE IS SOME ADDITIONAL INFO:
oops. as some background unfortunately I do need (in the case of the like statement) to have the % at the front.
There is business logic behind that which I can't avoid.
I have since created a full text catalogue on the field which is causing me problems
and converted the search to use the contains syntax.
Unfortunately although this has increased performance on occasion it appears to be slow (slower) for new word searchs.
So if i have apple.. apple appears to be faster the subsequent times but not for new searches of orange (for example).
So i don't think i can go with that (unless you can suggest some tinkering to make that more consistent).
Additional info:
the table contains only around 60k records
the field i'm trying to filter is a varchar(500)
sql 2000 on windows server 2003
The query i'm using is definitely convoluted
Sorry i've had to replace proprietary stuff.. but should give you and indication of the query:
SELECT TOP 99 AAAAAAAA.Item_ID, AAAAAAAA.CatID, AAAAAAAA.PID, AAAAAAAA.Description,
AAAAAAAA.Retail, AAAAAAAA.Pack, AAAAAAAA.CatID, AAAAAAAA.Code, BBBBBBBB.blahblah_PictureFile AS PictureFile,
AAAAAAAA.CL1, AAAAAAAA.CL1, AAAAAAAA.CL2, AAAAAAAA.CL3
FROM CCCCCCC INNER JOIN DDDDDDDD ON CCCCCCC.CID = DDDDDDDD.CID
INNER JOIN AAAAAAAA ON DDDDDDDD.CID = AAAAAAAA.CatID LEFT OUTER JOIN BBBBBBBB
ON AAAAAAAA.PID = BBBBBBBB.Product_ID INNER JOIN EEEEEEE ON AAAAAAAA.BID = EEEEEEE.ID
WHERE
(CCCCCCC.TID = 654321) AND (DDDDDDDD.In_Use = 1) AND (AAAAAAAA.Unused = 0)
AND (DDDDDDDD.Expiry > '10-11-2010 09:23:38') AND
(
(AAAAAAAA.Code = 'red pen') OR
(
(my_search_description LIKE '% red %') AND (my_search_description LIKE '% nose %')
AND (DDDDDDDD.CID IN (63,153,165,305,32,33))
)
)
AND (DDDDDDDD.CID IN (20,32,33,63,64,65,153,165,232,277,294,297,300,304,305,313,348,443,445,446,447,454,472,479,481,486,489,498))
ORDER BY AAAAAAAA.f_search_priority DESC, DDDDDDDD.Priority DESC, AAAAAAAA.Description ASC
You can see throwing in the my_search_description filter also includes a dddd.cid filter (business logic).
This is the part which is slowing things down (from a 1.5-2 second load of my pages down to a 6-8 second load (ow ow ow))
It might be my lack of understanding of how to have the full text search catelogue working.
Am very impressed by the answers so if anyone has any tips I'd be most greatful.
If you haven't already, enable full text indexing.
Unfortunately, using the LIKE clause on a query really does slow things down. Full Text Indexing is really the only way that I know of to speed things up (at the cost of storage space, of course).
Here's a link to an overview of Full-Text Search in SQL Server which will show you how to configure things and change your queries to take advantage of the full-text indexes.
More details would certainly help, but...
Full-text indexing can certainly be useful (depending on the more details about the table and your query). Full Text indexing requires a good bit of extra work both in setup and querying, but it's the only way to try to do the sort of search you seek efficiently.
The problem with LIKE that starts with a Wildcard is that SQL server has to do a complete table scan to find matching records - not only does it have to scan every row, but it has to read the contents of the char-based field you are querying.
With or without a full-text index, one thing can possibly help: Can you narrow the range of rows being searched, so at least SQL doesn't need to scan the whole table, but just some subset of it?
The '% blahblah %' is a problem for improving performance. Putting the wildcard at the beginning tells SQL Server that the string can begin with any legal character, so it must scan the entire index. Your best bet if you must have this filter is to focus on your other filters for improvement.
Using LIKE with a wildcard at the beginning of the search pattern forces the server to scan every row. It's unable to use any indexes. Indexes work from left to right, and since there is no constant on the left, no index is used.
From your WHERE clause, it looks like you're trying to find rows where a specific word exists in an entry. If you're searching for a whole word, then full text indexing may be a solution for you.
Full text indexing creates an index entry for each word that's contained in the specified column. You can then quickly find rows that contain a specific word.
As other posters have correctly pointed out, the use of the wildcard character % within the LIKE expression is resulting in a query plan being produced that uses a SCAN operation. A scan operation touches every row in the table or index, dependant on the type of scan operation being performed.
So the question really then becomes, do you actually need to search for the given text string anywhere within the column in question?
If not, great, problem solved but if it is essential to your business logic then you have two routes of optimization.
Really go to town on increasing the overall selectivity of your query by focusing your optimization efforts on the remaining search arguments.
Implement a Full Text Indexing Solution.
I don't think this is a valid answer, but I'd like to throw it out there for some more experienced posters comments...are these equivlent?
where (xxxxx.yyyy like '% blahblah %')
vs
where patindex(%blahbalh%, xxxx.yyyy) > 0
As far as I know, that's equivlent from a database logic standpoint as it's forcing the same scan. Guess it couldn't hurt to try?
suppose someone enter this search (on an form):
Nicole Kidman films
Which SQL i can use to find "the best" results ?
I suppose something like this :
SELECT * FROM myTable WHERE ( Field='%Nicole Kidman Films%' OR Field='%Nicole%' OR Field='%Kidman%' OR Field='%Films%' )
My question is how to get most relevant result ?
Thank you very much!
Full-Text Search:
SELECT * FROM myTable WHERE MATCH(Field) AGAINST('Nicole Kidman Films')
This query will return rows in order of relevancy, as defined by the full-text search algorithm.
Note: This is for MySQL, other DBMS have similar functionality.
What you're looking for is often called a "full text search" or a "natural language search". Unfortunately it's not standard SQL. Here's a tutorial on how to do it in mysql: http://devzone.zend.com/article/1304
You should be able to find examples for other database engines.
In SQL, the equals sign doesn't support wildcards in it - your query should really be:
SELECT *
FROM myTable
WHERE Field LIKE '%Nicole Kidman Films%'
OR Field LIKE '%Nicole%'
OR Field LIKE '%Kidman%'
OR Field LIKE '%Films%'
But wildcarding the left side won't use an index, if one exists.
A better approach is to use Full Text Searching, which most databases provide natively but there are 3rd party vendors like Sphinx. Each has it's own algorithm to assign a rank/score based on the criteria searched on in order to display what the algorithm deems most relevant.
I am not a SQL Expert. I’m trying to elegantly solve a query problem that others have had to have had. Surprisingly, Google is not returning anything that is helping. Basically, my application has a “search” box. This search field will allow a user to search for customers in the system. I have a table called “Customer” in my SQL Server 2008 database. This table is defined as follows:
Customer
UserName (nvarchar)
FirstName (nvarchar)
LastName (nvarchar)
As you can imagine, my users will enter queries of varying cases and probably mis-spell the customer’s names regularly. How do I query my customer table and return the 25 results that are closest to their query? I have no idea how to do this ranking and consider the three fields listed in my table.
Thank you!
I would suggest full-text search. Full-text search will provide plenty of options for dealing with some name variants and can rank the "closeness" of the results using CONTAINSTABLE. If you find that full-text search is not sufficient, you might consider a third-party indexing tool like Lucene.
You might want to try using SOUNDEX or DIFFERENCE as an alternative to full text search.
SELECT TOP 25 UserName, FirstName, LastName
FROM Customer
WHERE DIFFERENCE( UserName, #SearchValue ) > 2
ORDER BY DIFFERENCE( UserName, #SearchValue ), UserName
The case issue you can solve easy by setting your table collation to be case insensitive
The misspelling not sure how to handle but have a look at the full text search capabilities of sql server..
I am trying to implement a feature similar to the "Related Questions" on Stackoverflow.
How do I go about writing the SQL statement that will search the Title and Summary field of my database for similar questions?
If my questions is: "What is the SQL used to do a search similar to "Related Questions" on Stackoverflow".
Steps that I can think of are;
Strip the quotation marks
Split the sentence into an array of words and run a SQL search on each word.
If I do it this way, I am guessing that I wouldn't get any meaningful results. I am not sure if Full Text Search is enabled on the server, so I am not using that. Will there be an advantage of using Full Text Search?
I found a similar question but there was no answer: similar question
Using SQL 2005
Check out this podcast.
One of our major performance
optimizations for the “related
questions” query is removing the top
10,000 most common English dictionary
words (as determined by Google search)
before submitting the query to the SQL
Server 2008 full text engine. It’s
shocking how little is left of most
posts once you remove the top 10k
English dictionary words. This helps
limit and narrow the returned results,
which makes the query dramatically
faster.
They probably relate based on tags that are added to the questions...
After enabling Full Text search on my SQL 2005 server, I am using the following stored procedure to search for text.
ALTER PROCEDURE [dbo].[GetSimilarIssues]
(
#InputSearch varchar(255)
)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
DECLARE #SearchText varchar(500);
SELECT #SearchText = '"' + #InputSearch + '*"'
SELECT PostId, Summary, [Description],
Created
FROM Issue
WHERE FREETEXT (Summary, #SearchText);
END
I'm pretty sure it would be most efficient to implement the feature based on the tags associated with each post.
It's probably done using a full text search which matches like words/phrases. I've used it in MySQL and SQL Server with decent success with out of the box functionality.
You can find more on MySQL full text searches at:
http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html
Or just google Full Text search and you will find a lot of information.
It looks keyword based on the title you enter, queried against titles and content of other questions. Probably easier (and more appropriate) to do in Lucene (or similar) then in a relational database.
I'd say it's probably a full text search on the question title and the question content and answers as well using the individual words (not the whole title) you enter. Then, using the ranking features of full-text, the top 10 or so questions that rank the highest are displayed.
As tydok pointed out, it looks like they are using full-text searching (I couldn't imagine any other way).
Here's the MSDN reference on Full-Text Searching, nailing the specific query used probably isn't going to happen.
The SQL very well may be just "SELECT * FROM questions;". I find it hard to imagine that the algorithm for finding similar questions is implemented in SQL.