Searching efficiently with keywords - sql

I'm working with a big table (millions of rows) on a postgresql database, each row has a name column and i would like to perform a search on that column.
For instance, if i'm searching for the movie Django Unchained, i would like the query to return the movie whether i search for Django or for Unchained (or Dj or Uncha), just like the IMDB search engine.
I've looked up full text search but i believe it is more intended for long text, my name column will never be more than 4-5 words.
I've thought about having a table keywords with a many to many relationship, but i'm not sure that's the best way to do it.
What would be the most efficient way to query my database ?

My guess is that for what you want to do, full text search is the best solution. (Documented here.)
It does allow you to search for any complete words. It allows you to search for prefixes on words (such as "Dja"). Plus, you can add synonyms as necessary. It doesn't allow for wildcards at the beginning of a word, so "Jango" would need to be handled with a synonym.
If this doesn't meet your needs and you need the capabilities of like, I would suggest the following. Put the title into a separate table that basically has two columns: an id and the title. The goal is to make the scanning of the table as fast as possible, which in turn means getting the titles to fit in the smallest space possible.
There is an alternative solution, which is n-gram searching. I'm not sure if Postgres supports it natively, but here is an interesting article on the subject that include Postgres code for implementing it.

The standard way to search for a sub-string anywhere in a larger string is using the LIKE operator:
SELECT *
FROM mytable
WHERE name LIKE '%Unchai%';
However, in case you have millions of rows it will be slow because there are no significant efficiencies to be had from indexes.
You might want to dabble with multiple strategies, such as first retrieving records where the value for name starts with the search string (which can benefit from an index on the name column - LIKE 'Unchai%';) and then adding middle-of-the-string hits after a second non-indexed pass. Humans tend to be significantly slower than computers on interpreting strings, so the user may not suffer.

This question is very much related to the autocomplete in forms. You will find several threads for that.
Basically, you will need a special kind of index, a space partitioning tree. There is an extension called SP-GiST for Postgres which supports such index structures. You will find a bunch of useful stuff if you google for that.

Related

Is that possible to use full text index to find closest match strings? What does Statistical Semantics do in Full Text Indexing

I am looking for SQL Server 2016 full text indexes and they are awesome to make searches for finding multiple words containing strings
When i try to compose the full text index, it shows Statistical Semantics as a tickbox. What does statistical semantics do?
Moreover, I want to find did you mean queries
For example lets say i have a record as house. The user types hause
Can i use full text index to return hause as closest match and show user did you mean house efficiently ? thank you
I have tried soundex but the results it generates are terrible
It returns so many unrelated words
And since there are so many records in my database and i need very fast results, i need something SQL server natively supports
Any ideas? Any way to achieve such thing with using indexes?
I know there are multiple algorithms but they are not efficient enough for me to use online. I mean like calculating edit distance between each records. They could be used for offline projects but i need this efficiency in an online dictionary where there will be thousands of requests constantly.
I already have a plan in my mind. Storing not-found results in the database and offline calculating closest matches. And using them as cache. However, i wonder any possible online/live solution may exists? Consider that there will be over 100m nvarchar records
Short answer is no, Full Text Search cannot search for words that are similar, but different.
Full Text Search uses stemmers and thesaurus files:
The stemmer generates inflectional forms of a particular word based on the rules of that language (for example, "running", "ran", and "runner" are various forms of the word "run").
A Full-Text Search thesaurus defines a set of synonyms for a specific language.
Both stemmers and thesaurus are configurable and you can easily have FT match house for a search on hause, but only if you added hause as a synonym for house. This is obviously a non-solution as it requires you to add every possible typo as a synonym...
Semantic search is a different topic, it allows you to search for documents that are semantically close to a given example.
What you want is to find records that have a short Levenshtein distance from a given word (aka. 'fuzzy' search). I don't know of any technique for creating an index that can answer a Levenshtein search. If you're willing to scan the entire table for each term, T-SQL and CLR implementations of Levenshtein exists.

Performance with LIKE vs CONTAINS using full-text indexing

I have a table with a large(ish) amount of rows 500k, MSSQL Server 2008. I have a column which holds a nvarchar product ID which is usually 15 characters in length, alphabetical and numerical e.g. FF93F348HJKCF5HW9 . I would like to be able to search for this product ID with the best performance. I have done some research into using Full-Text indexing on this column and I dont really think that using full-text indexing using CONTAINS offers any benefit over using LIKE '%%'. This looks to be down to the fact Full-text indexing is more beneficial when searching for whole words, rather than a series of characters.
Can somebody confirm/deny this for me?
Full-Text indexing is about searching for language words in unstructured text data. Your data doesn't contain words, just a sequence of characters.
I haven't tested this, but I would expect that LIKE would actually be faster, as long as your data is indexed. CONTAINS is meant for searching for words & word-like structures.
If your requirement is for "auto-complete", then LIKE will perform pretty well since the optimizer will use an INDEX SEEK when you search for something such as LIKE 'F5521%'.
This MSDN article explains the basics of the CONTAINS keyword.

Sql Search in millions of records. Possible?

I have a table in my sql server 2005 database which contains about 50 million records.
I have firstName and LastName columns, and I would like to be able to allow the user to search on these columns without it taking forever.
Out of indexing these columns, is there a way to make my query work fast?
Also, I want to search similar sounded names. for example, if the user searches for Danny, I would like to return records with the name Dan, Daniel as well. It would be nice to show the user a rank in % how close the result he got to what he actually searched.
I know this is a tuff task, but I bet I'm not the first one in the world that face this issue :)
Thanks for your help.
We have databases with half a billion of records (Oracle, but should have similar performances). You can search in it within a few milli seconds if you have proper indexes. In your case, place an index on firstname and lastname. Using binary-tree index will perform good and will scale with the size of your database. Careful, LIKE clauses often break the use of the index and degrades largely the performances. I know MySQL can keep using indexes with LIKE clauses when wildcards are only at the right of the string. You would have to make similar search for SQL Server.
String similarity is indeed not simple. Have a look at http://en.wikipedia.org/wiki/Category:String_similarity_measures, you'll see some of the possible algorithms. Cannot say if SQL Server do implement one of them, dont know this database. Try to Google "SQL Server" + the name of the algorithms to maybe find what you need. Otherwise, you have code provided on Wiki for various languages (maybe not SQL but you should be able to adapt them for a stored procedure).
Have you tried full text indexing? I used it on free text fields in a table over 1 million records, and found it to be pretty fast. Plus you can add synonyms to it, so that Dan, Danial, and Danny all index as the same (where you get the dictionary of name equivalents is a different story). It does allow wildcard searches as well. Full text indexing can also do rank, though I found it to be less useful on names (better for documents).
use FUll TEXT SEARCH enable for this table and those columns, that will create full text index for those columns.

should I use LIKE to query tables with 4 million rows

I am designing a search form, and I am wondering whether should I give the possibility to search by using LIKE %search_string% for a table that is going to have up to 4 million rows
In general, I would say no. This is a good candidate for full-text indexing. The leading % in your search string is going to eliminate the possibility of using any indexes.
There may be cases where the wait is acceptable and/or you do not want the additional administrative overhead of maintaining full-text indexes, in which case you might opt for LIKE.
No, you should really only use LIKE '%...%' when your tables are relatively small or you don't care about the performance of your own or other peoples' queries on your database.
There are other ways to achieve this capability which scale much better, full text indexing or, if that's unavailable or not flexible enough, using insert/update triggers to extract non-noise words for querying later.
I mention that last possibility since you may not want a full text index. In other words, do you really care about words like "is", "or" and "but" (these are the noise-words I was alluding to before).
You can separate the field into words and place the relevant ones in another table and use blindingly fast queries on that table to find the actual rows.
The search with LIKE %search_string% is very slow even on indexed columns. Worstcase the search does a full table scan.
If a search LIKE search_string% is enough I'd just provide this possibility.
It depends - without knowing how responsive the search has to be, it could either be fine or completely no go. You'll only really know if you profile your search with likely data patterns and search criteria.
And as RedFilter points out, you might want to consider Full Text Search, if plain search isn't performing well

SQL full text search vs "LIKE"

Let's say I have a fairly simple app that lets users store information on DVDs they own (title, actors, year, description, etc.) and I want to allow users to search their collection by any of these fields (e.g. "Keanu Reeves" or "The Matrix" would be valid search queries).
What's the advantage of going with SQL full text search vs simply splitting the query up by spaces and doing a few "LIKE" clauses in the SQL statement? Does it simply perform better or will it actually return results that are more accurate?
Full text search is likely to be quicker since it will benefit from an index of words that it will use to look up the records, whereas using LIKE is going to need to full table scan.
In some cases LIKE will more accurate since LIKE "%The%" AND LIKE "%Matrix" will pick out "The Matrix" but not "Matrix Reloaded" whereas full text search will ignore "The" and return both. That said both would likely have been a better result.
Full-text indexes (which are indexes) are much faster than using LIKE (which essentially examines each row every time). However, if you know the database will be small, there may not be a performance need to use full-text indexes. The only way to determine this is with some intelligent averaging and some testing based on that information.
Accuracy is a different question. Full-text indexing allows you to do several things (weighting, automatically matching eat/eats/eating, etc.) you couldn't possibly implement that in any sort of reasonable time-frame using LIKE. The real question is whether you need those features.
Without reading the full-text documentation's description of these features, you're really not going to know how you should proceed. So, read up!
Also, some basic tests (insert a bunch of rows in a table, maybe with some sort of public dictionary as a source of words) will go a long way to helping you decide.
A full text search query is much faster. Especially when working which lots of data in various columns.
Additionally you will have language specific search support. E.g. german umlauts like "ü" in "über" will also be found when stored as "ueber". Also you can use synonyms where you can automatically expand search queries, or replace or substitute specific phrases.
In some cases LIKE will more accurate
since LIKE "%The%" AND LIKE "%Matrix"
will pick out "The Matrix" but not
"Matrix Reloaded" whereas full text
search will ignore "The" and return
both. That said both would likely have
been a better result.
That is not correct. The full text search syntax lets you specify "how" you want to search. E.g. by using the CONTAINS statement you can use exact term matching as well fuzzy matching, weights etc.
So if you have performance issues or would like to provide a more "Google-like" search experience, go for the full text search engine. It is also very easy to configure.
Just a few notes:
LIKE can use an Index Seek if you don't start your LIKE with %. Example: LIKE 'Santa M%' is good! LIKE '%Maria' is bad! and can cause a Table or Index Scan because this can't be indexed in the standard way.
This is very important. Full-Text Indexes updates are Asynchronous. For instance, if you perform an INSERT on a table followed by a SELECT with Full-Text Search where you expect the new data to appear, you might not get the data immediatly. Based on your configuration, you may have to wait a few seconds or a day. Generally, Full-Text Indexes are populated when your system does not have many requests.
It will perform better, but unless you have a lot of data you won't notice that difference. A SQL full text search index lets you use operators that are more advanced then a simple "LIKE" operation, but if all you do is the equivalent of a LIKE operation against your full text index then your results will be the same.
Imagine if you will allow to enter notes/descriptions on DVDs.
In this case it will be good to allow to search by descriptions.
Full text search in this case will do better job.
You may get slightly better results, or else at least have an easier implementation with full text indexing. But it depends on how you want it to work ...
What I have in mind is that if you are searching for two words, with LIKE you have to then manually implement (for example) a method to weight those with both higher in the list. A fulltext index should do this for you, and allow you to influence the weightings too using relevant syntax.
To FullTextSearch in SQL Server as LIKE
First, You have to create a StopList and assign it to your table
CREATE FULLTEXT STOPLIST [MyStopList];
GO
ALTER FULLTEXT INDEX ON dbo.[MyTableName] SET STOPLIST [MyStopList]
GO
Second, use the following tSql script:
SELECT * FROM dbo.[MyTableName] AS mt
WHERE CONTAINS((mt.ColumnName1,mt.ColumnName2,mt.ColumnName3), N'"*search text s*"')
If you do not just search English word, say you search a Chinese word, then how your fts tokenizes words will make your search a big different, as I gave an example here https://stackoverflow.com/a/31396975/301513. But I don't know how sql server tokenizes Chinese words, does it do a good job for that?