Prisma and Postgres. Using indexes for optimized 'LIKE' search - sql

I have a usual string column which is not unique and doesn't have any indexes. I've added a search function which uses the Prisma contains under the hood.
Currently it takes ąround 40ms for my queries (with the test database having around 14k records) which could be faster as I understood from this article: https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/
The issue is that nothing really changes after I add the trigram index, like this
CREATE INDEX CONCURRENTLY trgm_idx_users_name
ON users USING gin (name gin_trgm_ops);
The query execution time is literally the same. I also found that I can check if the index is actually used by disabling the full scan. And I really see that the execution time became times worse after that (meaning the added index is not actually used as it's performance is worse than full scan).
I was trying to use B-tree and Gin indexes.
The query example for testing is just searching records with LIKE:
SELECT *
FROM users
WHERE name LIKE '%test%'
ORDER BY NAME
LIMIT 10;
I couldn't find articles describing best practices for such "LIKE" queries that's why I'm asking here.
So my questions are:
Which index type is suitable for such case - if I need to find N records in string column using LIKE (prisma.io contains). Some docs say that the default B-tree is fine. Some articles show that Gin is better for that purpose.
As I'm using the prisma contains with Postgres, some features are still not supported in Prisma and different workarounds are needed (such as using unsupported types, experimental features etc.). I'd appreciate if Prisma examples were given.
Also, the full text search is not suitable as it requires the knowledge of the language which was used in the column, but my column will contain data in different languages

Related

SQL query'ish for what I want to achieve in Pouchdb

SELECT * FROM smo_images
WHERE search_term in search_helpers (search helpers is array or keywords)
OR search_term = smo_code
OR search_term = size
OR search_term = category;
I would like to achieve something like above in PouchDB. I am new to nosql and PouchDB. The documentation is confusing and not straightforward.
For me, the documentation is quite clear and straightforward. It mentions the old-school method of SQL:
Indexes in SQL databases
Quick refresher on how indexes work: in
relational databases like MySQL and PostgreSQL, you can usually query
whatever field you want:
SELECT * FROM pokemon WHERE name = 'Pikachu';
But if you don't want your performance to be terrible, you first add
an index:
ALTER TABLE pokemon ADD INDEX myIndex ON (name);
The job of the index is to ensure the field is stored in a B-tree
within the database, so your queries run in O(log(n)) time instead of
O(n) time.
From there, it starts a comparison with NoSQL:
Indexes in NoSQL databases
All of the above is also true in document stores like CouchDB and
MongoDB, but conceptually it's a little different. By default,
documents are assumed to be schemaless blobs with one primary key
(called _id in both Mongo and Couch), and any other keys need to be
specified separately. The concepts are largely the same; it's mostly
just the vocabulary that's different.
In CouchDB, queries are called map/reduce functions. This is because,
like most NoSQL databases, CouchDB is designed to scale well across
multiple computers, and to perform efficient query operations in
parallel. Basically, the idea is that you divide your query into a map
function and a reduce function, each of which may be executed in
parallel in a multi-node cluster.
It continues with descriptions and sample codes on Map/Reduce functions, temporary/persistent views and many more.

Searching efficiently with keywords

I'm working with a big table (millions of rows) on a postgresql database, each row has a name column and i would like to perform a search on that column.
For instance, if i'm searching for the movie Django Unchained, i would like the query to return the movie whether i search for Django or for Unchained (or Dj or Uncha), just like the IMDB search engine.
I've looked up full text search but i believe it is more intended for long text, my name column will never be more than 4-5 words.
I've thought about having a table keywords with a many to many relationship, but i'm not sure that's the best way to do it.
What would be the most efficient way to query my database ?
My guess is that for what you want to do, full text search is the best solution. (Documented here.)
It does allow you to search for any complete words. It allows you to search for prefixes on words (such as "Dja"). Plus, you can add synonyms as necessary. It doesn't allow for wildcards at the beginning of a word, so "Jango" would need to be handled with a synonym.
If this doesn't meet your needs and you need the capabilities of like, I would suggest the following. Put the title into a separate table that basically has two columns: an id and the title. The goal is to make the scanning of the table as fast as possible, which in turn means getting the titles to fit in the smallest space possible.
There is an alternative solution, which is n-gram searching. I'm not sure if Postgres supports it natively, but here is an interesting article on the subject that include Postgres code for implementing it.
The standard way to search for a sub-string anywhere in a larger string is using the LIKE operator:
SELECT *
FROM mytable
WHERE name LIKE '%Unchai%';
However, in case you have millions of rows it will be slow because there are no significant efficiencies to be had from indexes.
You might want to dabble with multiple strategies, such as first retrieving records where the value for name starts with the search string (which can benefit from an index on the name column - LIKE 'Unchai%';) and then adding middle-of-the-string hits after a second non-indexed pass. Humans tend to be significantly slower than computers on interpreting strings, so the user may not suffer.
This question is very much related to the autocomplete in forms. You will find several threads for that.
Basically, you will need a special kind of index, a space partitioning tree. There is an extension called SP-GiST for Postgres which supports such index structures. You will find a bunch of useful stuff if you google for that.

How does a full text search server like Sphinx work?

Can anyone explain in simple words how a full text server like Sphinx works? In plain SQL, one would use SQL queries like this to search for certain keywords in texts:
select * from items where name like '%keyword%';
But in the configuration files generated by various Sphinx plugins I can not see any queries like this at all. They contain instead SQL statements like the following, which seem to divide the search into distinct ID groups:
SELECT (items.id * 5 + 1) AS id, ...
WHERE items.id >= $start AND items.id <= $end
GROUP BY items.id
..
SELECT * FROM items WHERE items.id = (($id - 1) / 5)
It it possible to explain in simple words how these queries work and how they are generated?
Inverted Index is the answer to your question: http://en.wikipedia.org/wiki/Inverted_index
Now when you run a sql query through sphinx, it fetches the data from the database and constructs the inverted index which in Sphinx is like a hashtable where the key is a 32 bit integer which is calculated using crc32(word) and the value is the list of documentID's having that word.
This makes it super fast.
Now you can argue that even a database can create a similar structure for making the searches superfast. However the biggest difference is that a Sphinx/Lucene/Solr index is like a single-table database without any support for relational queries (JOINs) [From MySQL Performance Blog]. Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
EDIT: Also please see the source code in cpp files like searchd.cpp etc for the real internal implementation, I think you are just seeing the PHP wrappers.
Those queries you are looking at, are the query sphinx uses, to extract a copy of the data from the database, to put in its own index.
Sphinx needs a copy of the data to build it index (other answers have mentioned how that index works). You then ask for results (matching a specific query) from the searchd daemon - it consults the index and returns you matching documents.
The particular example you have choosen looks quite complicated, because it only extracting a part of the data, probbably for sharding - to split the index into parts for performance reasons. And is using range queries - so can access big datasets piecemeal.
An index could be built with a much simpler query, like
sql_query = select id,name,description from items
which would create a sphinx index, with two fields - name and description that could be searched/queried.
When searching, you would get back the unique id. http://sphinxsearch.com/info/faq/#row-storage
Full text search usually use one implementation of inverted index. In simple words, it brakes the content of a indexed field in tokens (words) and save a reference to that row, indexed by each token. For example, a field with The yellow dog for row #1 and The brown fox for row #2, will populate an index like:
brown -> row#2
dog -> row#1
fox -> row#2
The -> row#1
The -> row#2
yellow -> row#1
A short answer to the question is that databases such as MySQL are specifically designed for storing and indexing records and supporting SQL clauses (SELECT, PROJECT, JOIN, etc). Even though they can be used to do keyword search queries, they cannot give the best performance and features. Search engines such as Sphinx are designed specifically for keyword search queries, thus can provide much better support.

Search in huge table

I got table with over 1 millions rows.
This table represents user information, e.g userName, email, gender, marrial status etc.
I'm going to write search over all rows in this table, when some conditions are applied.
In simples case, when search is perfomed only on userName, it takes over 4-7 seconds to find result.
select from u where u.name ilike " ... "
Yes, i got indexes over some fileds. I checked that they are applied using explain analyse command.
How search can be boost ?
I heart something about Lucene, can it help ?
I'm wondering how does Facebook search working, they got billions users and their search works much faster.
There is great difference between these three queries:
a) SELECT * FROM u WHERE u.name LIKE "George%"
b) SELECT * FROM u WHERE u.name LIKE "%George"
c) SELECT * FROM u WHERE u.name LIKE "%George%"
a) The first will use the index on u.name (if there is one) and will be very fast.
b) The second will not be able to use any index on u.name but there are ways to circumvent that rather easily.
For example, you could add another field nameReversed in the table where REVERSE(name) is stored. With an index on that field, the query will be rewritten as (and will be as fast as the first one):
b2) SELECT * FROM u WHERE u.nameReversed LIKE REVERSE("%George")
c) The third query poses the greatest difficulty as neither of the two previous indexes will be of any help and the query will scan the whole table. Alternatives are:
Using a dedicated for such problems solution (search for "full text search"), like Sphinx. See this question on SO with more details: which-is-best-search-technique-to-search-records
If your field has names only (or another limited set of words, say a few hundred different words), you could create another auxilary table with those names (words) and store only a foreign key in table u.
If off course that is not the case and you have tens of thousands or millions different words or the field contains whole phrases, then to solve the problem with many auxilary tables, it's like creating a full text search tool for yourself. It's a nice exercise and you won't have to use Sphinx (or other) besides the RDBMS but it's not trivial.
Take a look at
Hibernate Search
this is using Lucene but a lot more easier to implement.
Google or Facebook are using different approaches. They have distributed systems. Googles BigTable is a good keyword or the "Map and Reduce" concept (Apache Hadoop) is a good starting point for more research.
Try to use table partitioning.
In large table scenarios can be helpful to partiton a table.
For PostgreSQL try here PostgreSQL Partitioning.
For high scalable fast performance searches, sometimes may be useful to adopt NoSQL database (like Facebook does).
I heart something about Lucene, can it help ?
Yes, it can. I'm sure, you will love it!
I had the same problem: An table with round about 1.2 Million Messages. By searching trough these Messages it needs some seconds. An full text search on the "message" column needs about 10 seconds.
At the same server hardware lucene returns the result in about 200-400ms.
That's very fast.
Cached results returns in round about 5-10 ms.
Lucene is able to connect to your SQL database (for example mysql) - scans your database an builds an searchable index.
For searching this index it depends on the kind of application.
I my case, my PHP Webaplication uses solr for searching inside lucene.
http://lucene.apache.org/solr/

SQL full text search vs "LIKE"

Let's say I have a fairly simple app that lets users store information on DVDs they own (title, actors, year, description, etc.) and I want to allow users to search their collection by any of these fields (e.g. "Keanu Reeves" or "The Matrix" would be valid search queries).
What's the advantage of going with SQL full text search vs simply splitting the query up by spaces and doing a few "LIKE" clauses in the SQL statement? Does it simply perform better or will it actually return results that are more accurate?
Full text search is likely to be quicker since it will benefit from an index of words that it will use to look up the records, whereas using LIKE is going to need to full table scan.
In some cases LIKE will more accurate since LIKE "%The%" AND LIKE "%Matrix" will pick out "The Matrix" but not "Matrix Reloaded" whereas full text search will ignore "The" and return both. That said both would likely have been a better result.
Full-text indexes (which are indexes) are much faster than using LIKE (which essentially examines each row every time). However, if you know the database will be small, there may not be a performance need to use full-text indexes. The only way to determine this is with some intelligent averaging and some testing based on that information.
Accuracy is a different question. Full-text indexing allows you to do several things (weighting, automatically matching eat/eats/eating, etc.) you couldn't possibly implement that in any sort of reasonable time-frame using LIKE. The real question is whether you need those features.
Without reading the full-text documentation's description of these features, you're really not going to know how you should proceed. So, read up!
Also, some basic tests (insert a bunch of rows in a table, maybe with some sort of public dictionary as a source of words) will go a long way to helping you decide.
A full text search query is much faster. Especially when working which lots of data in various columns.
Additionally you will have language specific search support. E.g. german umlauts like "ü" in "über" will also be found when stored as "ueber". Also you can use synonyms where you can automatically expand search queries, or replace or substitute specific phrases.
In some cases LIKE will more accurate
since LIKE "%The%" AND LIKE "%Matrix"
will pick out "The Matrix" but not
"Matrix Reloaded" whereas full text
search will ignore "The" and return
both. That said both would likely have
been a better result.
That is not correct. The full text search syntax lets you specify "how" you want to search. E.g. by using the CONTAINS statement you can use exact term matching as well fuzzy matching, weights etc.
So if you have performance issues or would like to provide a more "Google-like" search experience, go for the full text search engine. It is also very easy to configure.
Just a few notes:
LIKE can use an Index Seek if you don't start your LIKE with %. Example: LIKE 'Santa M%' is good! LIKE '%Maria' is bad! and can cause a Table or Index Scan because this can't be indexed in the standard way.
This is very important. Full-Text Indexes updates are Asynchronous. For instance, if you perform an INSERT on a table followed by a SELECT with Full-Text Search where you expect the new data to appear, you might not get the data immediatly. Based on your configuration, you may have to wait a few seconds or a day. Generally, Full-Text Indexes are populated when your system does not have many requests.
It will perform better, but unless you have a lot of data you won't notice that difference. A SQL full text search index lets you use operators that are more advanced then a simple "LIKE" operation, but if all you do is the equivalent of a LIKE operation against your full text index then your results will be the same.
Imagine if you will allow to enter notes/descriptions on DVDs.
In this case it will be good to allow to search by descriptions.
Full text search in this case will do better job.
You may get slightly better results, or else at least have an easier implementation with full text indexing. But it depends on how you want it to work ...
What I have in mind is that if you are searching for two words, with LIKE you have to then manually implement (for example) a method to weight those with both higher in the list. A fulltext index should do this for you, and allow you to influence the weightings too using relevant syntax.
To FullTextSearch in SQL Server as LIKE
First, You have to create a StopList and assign it to your table
CREATE FULLTEXT STOPLIST [MyStopList];
GO
ALTER FULLTEXT INDEX ON dbo.[MyTableName] SET STOPLIST [MyStopList]
GO
Second, use the following tSql script:
SELECT * FROM dbo.[MyTableName] AS mt
WHERE CONTAINS((mt.ColumnName1,mt.ColumnName2,mt.ColumnName3), N'"*search text s*"')
If you do not just search English word, say you search a Chinese word, then how your fts tokenizes words will make your search a big different, as I gave an example here https://stackoverflow.com/a/31396975/301513. But I don't know how sql server tokenizes Chinese words, does it do a good job for that?