How does a full text search server like Sphinx work? - sql

Can anyone explain in simple words how a full text server like Sphinx works? In plain SQL, one would use SQL queries like this to search for certain keywords in texts:
select * from items where name like '%keyword%';
But in the configuration files generated by various Sphinx plugins I can not see any queries like this at all. They contain instead SQL statements like the following, which seem to divide the search into distinct ID groups:
SELECT (items.id * 5 + 1) AS id, ...
WHERE items.id >= $start AND items.id <= $end
GROUP BY items.id
..
SELECT * FROM items WHERE items.id = (($id - 1) / 5)
It it possible to explain in simple words how these queries work and how they are generated?

Inverted Index is the answer to your question: http://en.wikipedia.org/wiki/Inverted_index
Now when you run a sql query through sphinx, it fetches the data from the database and constructs the inverted index which in Sphinx is like a hashtable where the key is a 32 bit integer which is calculated using crc32(word) and the value is the list of documentID's having that word.
This makes it super fast.
Now you can argue that even a database can create a similar structure for making the searches superfast. However the biggest difference is that a Sphinx/Lucene/Solr index is like a single-table database without any support for relational queries (JOINs) [From MySQL Performance Blog]. Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
EDIT: Also please see the source code in cpp files like searchd.cpp etc for the real internal implementation, I think you are just seeing the PHP wrappers.

Those queries you are looking at, are the query sphinx uses, to extract a copy of the data from the database, to put in its own index.
Sphinx needs a copy of the data to build it index (other answers have mentioned how that index works). You then ask for results (matching a specific query) from the searchd daemon - it consults the index and returns you matching documents.
The particular example you have choosen looks quite complicated, because it only extracting a part of the data, probbably for sharding - to split the index into parts for performance reasons. And is using range queries - so can access big datasets piecemeal.
An index could be built with a much simpler query, like
sql_query = select id,name,description from items
which would create a sphinx index, with two fields - name and description that could be searched/queried.
When searching, you would get back the unique id. http://sphinxsearch.com/info/faq/#row-storage

Full text search usually use one implementation of inverted index. In simple words, it brakes the content of a indexed field in tokens (words) and save a reference to that row, indexed by each token. For example, a field with The yellow dog for row #1 and The brown fox for row #2, will populate an index like:
brown -> row#2
dog -> row#1
fox -> row#2
The -> row#1
The -> row#2
yellow -> row#1

A short answer to the question is that databases such as MySQL are specifically designed for storing and indexing records and supporting SQL clauses (SELECT, PROJECT, JOIN, etc). Even though they can be used to do keyword search queries, they cannot give the best performance and features. Search engines such as Sphinx are designed specifically for keyword search queries, thus can provide much better support.

Related

Prisma and Postgres. Using indexes for optimized 'LIKE' search

I have a usual string column which is not unique and doesn't have any indexes. I've added a search function which uses the Prisma contains under the hood.
Currently it takes ąround 40ms for my queries (with the test database having around 14k records) which could be faster as I understood from this article: https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/
The issue is that nothing really changes after I add the trigram index, like this
CREATE INDEX CONCURRENTLY trgm_idx_users_name
ON users USING gin (name gin_trgm_ops);
The query execution time is literally the same. I also found that I can check if the index is actually used by disabling the full scan. And I really see that the execution time became times worse after that (meaning the added index is not actually used as it's performance is worse than full scan).
I was trying to use B-tree and Gin indexes.
The query example for testing is just searching records with LIKE:
SELECT *
FROM users
WHERE name LIKE '%test%'
ORDER BY NAME
LIMIT 10;
I couldn't find articles describing best practices for such "LIKE" queries that's why I'm asking here.
So my questions are:
Which index type is suitable for such case - if I need to find N records in string column using LIKE (prisma.io contains). Some docs say that the default B-tree is fine. Some articles show that Gin is better for that purpose.
As I'm using the prisma contains with Postgres, some features are still not supported in Prisma and different workarounds are needed (such as using unsupported types, experimental features etc.). I'd appreciate if Prisma examples were given.
Also, the full text search is not suitable as it requires the knowledge of the language which was used in the column, but my column will contain data in different languages

Searching efficiently with keywords

I'm working with a big table (millions of rows) on a postgresql database, each row has a name column and i would like to perform a search on that column.
For instance, if i'm searching for the movie Django Unchained, i would like the query to return the movie whether i search for Django or for Unchained (or Dj or Uncha), just like the IMDB search engine.
I've looked up full text search but i believe it is more intended for long text, my name column will never be more than 4-5 words.
I've thought about having a table keywords with a many to many relationship, but i'm not sure that's the best way to do it.
What would be the most efficient way to query my database ?
My guess is that for what you want to do, full text search is the best solution. (Documented here.)
It does allow you to search for any complete words. It allows you to search for prefixes on words (such as "Dja"). Plus, you can add synonyms as necessary. It doesn't allow for wildcards at the beginning of a word, so "Jango" would need to be handled with a synonym.
If this doesn't meet your needs and you need the capabilities of like, I would suggest the following. Put the title into a separate table that basically has two columns: an id and the title. The goal is to make the scanning of the table as fast as possible, which in turn means getting the titles to fit in the smallest space possible.
There is an alternative solution, which is n-gram searching. I'm not sure if Postgres supports it natively, but here is an interesting article on the subject that include Postgres code for implementing it.
The standard way to search for a sub-string anywhere in a larger string is using the LIKE operator:
SELECT *
FROM mytable
WHERE name LIKE '%Unchai%';
However, in case you have millions of rows it will be slow because there are no significant efficiencies to be had from indexes.
You might want to dabble with multiple strategies, such as first retrieving records where the value for name starts with the search string (which can benefit from an index on the name column - LIKE 'Unchai%';) and then adding middle-of-the-string hits after a second non-indexed pass. Humans tend to be significantly slower than computers on interpreting strings, so the user may not suffer.
This question is very much related to the autocomplete in forms. You will find several threads for that.
Basically, you will need a special kind of index, a space partitioning tree. There is an extension called SP-GiST for Postgres which supports such index structures. You will find a bunch of useful stuff if you google for that.

Sql Search in millions of records. Possible?

I have a table in my sql server 2005 database which contains about 50 million records.
I have firstName and LastName columns, and I would like to be able to allow the user to search on these columns without it taking forever.
Out of indexing these columns, is there a way to make my query work fast?
Also, I want to search similar sounded names. for example, if the user searches for Danny, I would like to return records with the name Dan, Daniel as well. It would be nice to show the user a rank in % how close the result he got to what he actually searched.
I know this is a tuff task, but I bet I'm not the first one in the world that face this issue :)
Thanks for your help.
We have databases with half a billion of records (Oracle, but should have similar performances). You can search in it within a few milli seconds if you have proper indexes. In your case, place an index on firstname and lastname. Using binary-tree index will perform good and will scale with the size of your database. Careful, LIKE clauses often break the use of the index and degrades largely the performances. I know MySQL can keep using indexes with LIKE clauses when wildcards are only at the right of the string. You would have to make similar search for SQL Server.
String similarity is indeed not simple. Have a look at http://en.wikipedia.org/wiki/Category:String_similarity_measures, you'll see some of the possible algorithms. Cannot say if SQL Server do implement one of them, dont know this database. Try to Google "SQL Server" + the name of the algorithms to maybe find what you need. Otherwise, you have code provided on Wiki for various languages (maybe not SQL but you should be able to adapt them for a stored procedure).
Have you tried full text indexing? I used it on free text fields in a table over 1 million records, and found it to be pretty fast. Plus you can add synonyms to it, so that Dan, Danial, and Danny all index as the same (where you get the dictionary of name equivalents is a different story). It does allow wildcard searches as well. Full text indexing can also do rank, though I found it to be less useful on names (better for documents).
use FUll TEXT SEARCH enable for this table and those columns, that will create full text index for those columns.

Search in huge table

I got table with over 1 millions rows.
This table represents user information, e.g userName, email, gender, marrial status etc.
I'm going to write search over all rows in this table, when some conditions are applied.
In simples case, when search is perfomed only on userName, it takes over 4-7 seconds to find result.
select from u where u.name ilike " ... "
Yes, i got indexes over some fileds. I checked that they are applied using explain analyse command.
How search can be boost ?
I heart something about Lucene, can it help ?
I'm wondering how does Facebook search working, they got billions users and their search works much faster.
There is great difference between these three queries:
a) SELECT * FROM u WHERE u.name LIKE "George%"
b) SELECT * FROM u WHERE u.name LIKE "%George"
c) SELECT * FROM u WHERE u.name LIKE "%George%"
a) The first will use the index on u.name (if there is one) and will be very fast.
b) The second will not be able to use any index on u.name but there are ways to circumvent that rather easily.
For example, you could add another field nameReversed in the table where REVERSE(name) is stored. With an index on that field, the query will be rewritten as (and will be as fast as the first one):
b2) SELECT * FROM u WHERE u.nameReversed LIKE REVERSE("%George")
c) The third query poses the greatest difficulty as neither of the two previous indexes will be of any help and the query will scan the whole table. Alternatives are:
Using a dedicated for such problems solution (search for "full text search"), like Sphinx. See this question on SO with more details: which-is-best-search-technique-to-search-records
If your field has names only (or another limited set of words, say a few hundred different words), you could create another auxilary table with those names (words) and store only a foreign key in table u.
If off course that is not the case and you have tens of thousands or millions different words or the field contains whole phrases, then to solve the problem with many auxilary tables, it's like creating a full text search tool for yourself. It's a nice exercise and you won't have to use Sphinx (or other) besides the RDBMS but it's not trivial.
Take a look at
Hibernate Search
this is using Lucene but a lot more easier to implement.
Google or Facebook are using different approaches. They have distributed systems. Googles BigTable is a good keyword or the "Map and Reduce" concept (Apache Hadoop) is a good starting point for more research.
Try to use table partitioning.
In large table scenarios can be helpful to partiton a table.
For PostgreSQL try here PostgreSQL Partitioning.
For high scalable fast performance searches, sometimes may be useful to adopt NoSQL database (like Facebook does).
I heart something about Lucene, can it help ?
Yes, it can. I'm sure, you will love it!
I had the same problem: An table with round about 1.2 Million Messages. By searching trough these Messages it needs some seconds. An full text search on the "message" column needs about 10 seconds.
At the same server hardware lucene returns the result in about 200-400ms.
That's very fast.
Cached results returns in round about 5-10 ms.
Lucene is able to connect to your SQL database (for example mysql) - scans your database an builds an searchable index.
For searching this index it depends on the kind of application.
I my case, my PHP Webaplication uses solr for searching inside lucene.
http://lucene.apache.org/solr/

SQL full text search vs "LIKE"

Let's say I have a fairly simple app that lets users store information on DVDs they own (title, actors, year, description, etc.) and I want to allow users to search their collection by any of these fields (e.g. "Keanu Reeves" or "The Matrix" would be valid search queries).
What's the advantage of going with SQL full text search vs simply splitting the query up by spaces and doing a few "LIKE" clauses in the SQL statement? Does it simply perform better or will it actually return results that are more accurate?
Full text search is likely to be quicker since it will benefit from an index of words that it will use to look up the records, whereas using LIKE is going to need to full table scan.
In some cases LIKE will more accurate since LIKE "%The%" AND LIKE "%Matrix" will pick out "The Matrix" but not "Matrix Reloaded" whereas full text search will ignore "The" and return both. That said both would likely have been a better result.
Full-text indexes (which are indexes) are much faster than using LIKE (which essentially examines each row every time). However, if you know the database will be small, there may not be a performance need to use full-text indexes. The only way to determine this is with some intelligent averaging and some testing based on that information.
Accuracy is a different question. Full-text indexing allows you to do several things (weighting, automatically matching eat/eats/eating, etc.) you couldn't possibly implement that in any sort of reasonable time-frame using LIKE. The real question is whether you need those features.
Without reading the full-text documentation's description of these features, you're really not going to know how you should proceed. So, read up!
Also, some basic tests (insert a bunch of rows in a table, maybe with some sort of public dictionary as a source of words) will go a long way to helping you decide.
A full text search query is much faster. Especially when working which lots of data in various columns.
Additionally you will have language specific search support. E.g. german umlauts like "ü" in "über" will also be found when stored as "ueber". Also you can use synonyms where you can automatically expand search queries, or replace or substitute specific phrases.
In some cases LIKE will more accurate
since LIKE "%The%" AND LIKE "%Matrix"
will pick out "The Matrix" but not
"Matrix Reloaded" whereas full text
search will ignore "The" and return
both. That said both would likely have
been a better result.
That is not correct. The full text search syntax lets you specify "how" you want to search. E.g. by using the CONTAINS statement you can use exact term matching as well fuzzy matching, weights etc.
So if you have performance issues or would like to provide a more "Google-like" search experience, go for the full text search engine. It is also very easy to configure.
Just a few notes:
LIKE can use an Index Seek if you don't start your LIKE with %. Example: LIKE 'Santa M%' is good! LIKE '%Maria' is bad! and can cause a Table or Index Scan because this can't be indexed in the standard way.
This is very important. Full-Text Indexes updates are Asynchronous. For instance, if you perform an INSERT on a table followed by a SELECT with Full-Text Search where you expect the new data to appear, you might not get the data immediatly. Based on your configuration, you may have to wait a few seconds or a day. Generally, Full-Text Indexes are populated when your system does not have many requests.
It will perform better, but unless you have a lot of data you won't notice that difference. A SQL full text search index lets you use operators that are more advanced then a simple "LIKE" operation, but if all you do is the equivalent of a LIKE operation against your full text index then your results will be the same.
Imagine if you will allow to enter notes/descriptions on DVDs.
In this case it will be good to allow to search by descriptions.
Full text search in this case will do better job.
You may get slightly better results, or else at least have an easier implementation with full text indexing. But it depends on how you want it to work ...
What I have in mind is that if you are searching for two words, with LIKE you have to then manually implement (for example) a method to weight those with both higher in the list. A fulltext index should do this for you, and allow you to influence the weightings too using relevant syntax.
To FullTextSearch in SQL Server as LIKE
First, You have to create a StopList and assign it to your table
CREATE FULLTEXT STOPLIST [MyStopList];
GO
ALTER FULLTEXT INDEX ON dbo.[MyTableName] SET STOPLIST [MyStopList]
GO
Second, use the following tSql script:
SELECT * FROM dbo.[MyTableName] AS mt
WHERE CONTAINS((mt.ColumnName1,mt.ColumnName2,mt.ColumnName3), N'"*search text s*"')
If you do not just search English word, say you search a Chinese word, then how your fts tokenizes words will make your search a big different, as I gave an example here https://stackoverflow.com/a/31396975/301513. But I don't know how sql server tokenizes Chinese words, does it do a good job for that?