SELECT * FROM smo_images
WHERE search_term in search_helpers (search helpers is array or keywords)
OR search_term = smo_code
OR search_term = size
OR search_term = category;
I would like to achieve something like above in PouchDB. I am new to nosql and PouchDB. The documentation is confusing and not straightforward.
For me, the documentation is quite clear and straightforward. It mentions the old-school method of SQL:
Indexes in SQL databases
Quick refresher on how indexes work: in
relational databases like MySQL and PostgreSQL, you can usually query
whatever field you want:
SELECT * FROM pokemon WHERE name = 'Pikachu';
But if you don't want your performance to be terrible, you first add
an index:
ALTER TABLE pokemon ADD INDEX myIndex ON (name);
The job of the index is to ensure the field is stored in a B-tree
within the database, so your queries run in O(log(n)) time instead of
O(n) time.
From there, it starts a comparison with NoSQL:
Indexes in NoSQL databases
All of the above is also true in document stores like CouchDB and
MongoDB, but conceptually it's a little different. By default,
documents are assumed to be schemaless blobs with one primary key
(called _id in both Mongo and Couch), and any other keys need to be
specified separately. The concepts are largely the same; it's mostly
just the vocabulary that's different.
In CouchDB, queries are called map/reduce functions. This is because,
like most NoSQL databases, CouchDB is designed to scale well across
multiple computers, and to perform efficient query operations in
parallel. Basically, the idea is that you divide your query into a map
function and a reduce function, each of which may be executed in
parallel in a multi-node cluster.
It continues with descriptions and sample codes on Map/Reduce functions, temporary/persistent views and many more.
Related
I have a usual string column which is not unique and doesn't have any indexes. I've added a search function which uses the Prisma contains under the hood.
Currently it takes ąround 40ms for my queries (with the test database having around 14k records) which could be faster as I understood from this article: https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/
The issue is that nothing really changes after I add the trigram index, like this
CREATE INDEX CONCURRENTLY trgm_idx_users_name
ON users USING gin (name gin_trgm_ops);
The query execution time is literally the same. I also found that I can check if the index is actually used by disabling the full scan. And I really see that the execution time became times worse after that (meaning the added index is not actually used as it's performance is worse than full scan).
I was trying to use B-tree and Gin indexes.
The query example for testing is just searching records with LIKE:
SELECT *
FROM users
WHERE name LIKE '%test%'
ORDER BY NAME
LIMIT 10;
I couldn't find articles describing best practices for such "LIKE" queries that's why I'm asking here.
So my questions are:
Which index type is suitable for such case - if I need to find N records in string column using LIKE (prisma.io contains). Some docs say that the default B-tree is fine. Some articles show that Gin is better for that purpose.
As I'm using the prisma contains with Postgres, some features are still not supported in Prisma and different workarounds are needed (such as using unsupported types, experimental features etc.). I'd appreciate if Prisma examples were given.
Also, the full text search is not suitable as it requires the knowledge of the language which was used in the column, but my column will contain data in different languages
While we believe that NoSQL Databases have come to fill a number of gaps which are challenging on the side of RDBMS, i have had several challenges over time with NoSQL DBs in the area of their query eco-system.
Couchbase for example, like its mother CouchDB have had major improvements in reading data using views, lists, Key lookups, map reduce, e.t.c. Couchbase has even moved to create an SQL-like query engine for their huge 2.X verson. MongoDB has also made serious improvements and complex queries are possible on it and many other NoSQL DB developments going on out there.
Most NoSQL DBs can perform Complex queries based on LOGICAL and COMPARISON OPERATORS e.g. AND, OR,== e.t.c However, aggergation and performing complex relations on data are a problem on my part. For example, in CouchDB and/or Couchbase, Views span only a single DB. It is not possible to write a view which will aggregate data from two or more databases. Let me now get to the problem. Functions (whether aggregate or not): AVG, SUM, ROUND,TRUNC,MAX, MIN, e.t.c The lack of data types makes it impossible to efficiently work with Date and Times hence the lack of Date and time functions e.g. TO_DATE,SYSDATE (for system date/time), ADD_MONTHSs, DATE BETWEEN, DATE/TIME format Conversion e.t.c. It is true, that many will say that , they lack Schemas, types and stuff, but, i have found myself not running away from the need for atleast any one of the functions listed up there. For example because NoSQL DBs have no Date/Time data type, it is hard to perform queries based on those, because you might want to analyse trends based on time. Also, others have tried to use UNIX/EPOC Time stamps and stuff to solve this but it aint a single size fits all solution. Map Reduce can be used to attain aggregation to a certain (small) degree, but the overhead has been realised to be great. However, the lack of GROUP BY functionality makes it a straineous solution to filter through what ou want. Look at the query below:
SELECT
doc.field1, doc.field3, SUM(doc.field2 + doc.field4)
FROM
couchdb.my_database
GROUP BY doc.field1, doc.field3
HAVING SUM(doc.field2 + doc.field4) > 20000;
This is not very easy to attain on CouchDB or Couchbase. i am not sure if its possible on MongoDB. I wish it were possible out of the box. This has made it difficult to use NoSQL as a Data warehouse or OLTP/OLAP solution. I found that, each time a complex analysis needs to be made, one needs to do it in the middle ware by paging through different datasets. Now, most experienced Guys (e.g. CLOUDANT) have tweaked LUCENE to perform complex queries, but because it was initially meant for indexing and text search, it has not solved the lack of FUNCTIONS and DATA AGGREGATION on most NoSQL DBs.
Because of lack of FUNCTIONS, most NoSQL DBs have the NULL data type but lack the option of converting NULL Objects to something else, like it is in some RDBMS. For example in Oracle, i could: NVL(COLUMN,0) in order to include all the rows while performing say an AVG calculation on a given column (since say, by default the null columns will not be counted/included in the query processing).
To fully understand the problem, CouchDB views for example operate within the scope of a doc like this below:
function(doc){
// if statements, logical operators, comparison operators
// e.t.c here. until you do am emit of that doc
// if it satisfies the conditions set
// emit(null, doc) OR emit(doc.x,[doc.y, doc.z]) e.t.c.
// you can only emit javascript data types anyways
emit(doc.field1,doc)
}
The docs which satisfy the filters, are let through and go onto the next stage or to a reduce function. Imagine a doc structure like this below:
{
x: '',
y: '',
z: {
p: '',
n: N // integer or number data type
},
date: 'DD/MON/YYYY' // date format
}
Now, lets imagine the possibility of this kind of query:
function(){
var average = select AVG(doc.z.n) from couchdb.my_database;
var Result = select doc.x,doc.y from couchdb.my_database where
doc.z.n > average and doc.y = 'some string' and
doc.date between '01-JUN-2012' and '03-AUG-2012';
emit(Result);
}
OR if this query were possible:
function(){
var latest = select MAX(doc.date) from couchdb.my_database;
var Result = select
doc.x,doc.z.p,MONTHS_BETWEEN(doc.date,latest) as "Months_interval"
from couchdb.my_database where doc.y like '%john%'
order by doc.z.p;
emit(Result);
}
Qn 1: Which NoSQL Database solution has attained to a great degree, the query capability being talked about in the details above ? what key features make it stand out ?
Qn 2: Is the lack of a Schema, or the characteristic of being Key-Value a reason for the lack of FUNCTIONS in Querying these Databases ? What is the reason for the lack of Aggregate functionality in most NoSQL DBs ?
Qn 3: If the query ability above is possible in any of the NoSQL DBs, show how the last two (2) query problems above can be attained using the existing NoSQL infrastracture (consider any NoSQL technology of your choice)
MongoDB has something called Aggregation Framework and it works pretty well. I would say that almost every SQL Aggregation query could be carried out with this framework. Here you have some examples of "conversion" from SQL to Aggregation Framework.
Anyway MongoDB is a document oriented database and not key-value like CouchDB, so I don't know if it fits your requirements.
If I have a table of, say, blog posts, with columns such as post_id and author_id, and I used the SQL "SELECT * FROM post_table where author_id = 34", what would be the computational complexity of that query? Would it simply look through each row and check if it has the correct author id, O(n), or does it do something more efficient?
I was just wondering because I'm in a situation where I could either search an SQL database with this data, or load an xml file with a list of posts, and search through those, I was wondering which would be faster.
There are two basic ways that such a simple query would be executed.
The first is to do a full table scan. This would have O(n) performance.
The second is a to look up the value in an index, then load the page, and return the results. The index scan should be O(log(n)). Loading the page should be O(1).
With a more complicated query, it would be hard to make such a general statement. But any SQL engine is generally going to take one of these two paths. Oh, there is a third option if the table is partitioned on author_id, but you are probably not interested in that.
That said, the power of a database is not in these details. It is in the management of memory. The database will cache the data and index in memory, so you do not have to re-read data pages. The database will take advantage of multiple processors and multiple disks, so you do not have to code this. The database keeps everything consistent, in the face of updates and deletes.
As for your specific question. If the data is in the database, search it there. Loading all the data into an xml file and then doing the search in memory requires a lot of overhead. You would only want to do that if the connection to your database is slow and you are doing many such queries.
Have a look at the EXPLAIN command. It shows you what the database actually does when executing a given SELECT query.
Can anyone explain in simple words how a full text server like Sphinx works? In plain SQL, one would use SQL queries like this to search for certain keywords in texts:
select * from items where name like '%keyword%';
But in the configuration files generated by various Sphinx plugins I can not see any queries like this at all. They contain instead SQL statements like the following, which seem to divide the search into distinct ID groups:
SELECT (items.id * 5 + 1) AS id, ...
WHERE items.id >= $start AND items.id <= $end
GROUP BY items.id
..
SELECT * FROM items WHERE items.id = (($id - 1) / 5)
It it possible to explain in simple words how these queries work and how they are generated?
Inverted Index is the answer to your question: http://en.wikipedia.org/wiki/Inverted_index
Now when you run a sql query through sphinx, it fetches the data from the database and constructs the inverted index which in Sphinx is like a hashtable where the key is a 32 bit integer which is calculated using crc32(word) and the value is the list of documentID's having that word.
This makes it super fast.
Now you can argue that even a database can create a similar structure for making the searches superfast. However the biggest difference is that a Sphinx/Lucene/Solr index is like a single-table database without any support for relational queries (JOINs) [From MySQL Performance Blog]. Remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely be de-normalized and contain mostly just the data needed to be searched.
Another possible reason is generally databases suffer from internal fragmentation, they need to perform too much semi-random I/O tasks on huge requests.
What that means is, for example, considering the index architecture of a databases, the query leads to the indexes which in turn lead to the data. If the data to recover is widely spread, the result will take long and that seems to be what happens in databases.
EDIT: Also please see the source code in cpp files like searchd.cpp etc for the real internal implementation, I think you are just seeing the PHP wrappers.
Those queries you are looking at, are the query sphinx uses, to extract a copy of the data from the database, to put in its own index.
Sphinx needs a copy of the data to build it index (other answers have mentioned how that index works). You then ask for results (matching a specific query) from the searchd daemon - it consults the index and returns you matching documents.
The particular example you have choosen looks quite complicated, because it only extracting a part of the data, probbably for sharding - to split the index into parts for performance reasons. And is using range queries - so can access big datasets piecemeal.
An index could be built with a much simpler query, like
sql_query = select id,name,description from items
which would create a sphinx index, with two fields - name and description that could be searched/queried.
When searching, you would get back the unique id. http://sphinxsearch.com/info/faq/#row-storage
Full text search usually use one implementation of inverted index. In simple words, it brakes the content of a indexed field in tokens (words) and save a reference to that row, indexed by each token. For example, a field with The yellow dog for row #1 and The brown fox for row #2, will populate an index like:
brown -> row#2
dog -> row#1
fox -> row#2
The -> row#1
The -> row#2
yellow -> row#1
A short answer to the question is that databases such as MySQL are specifically designed for storing and indexing records and supporting SQL clauses (SELECT, PROJECT, JOIN, etc). Even though they can be used to do keyword search queries, they cannot give the best performance and features. Search engines such as Sphinx are designed specifically for keyword search queries, thus can provide much better support.
I've been doing the relational database thing for years now, but lately have moved into Cassandra/Redis territory. NoSQL makes sense for what we're doing, so that's fine.
As I was working through defining Cassandra column families today a question occurred to me: In relational databases, why doesn't DDL let us define denormalization rules in such a way that the database engine itself could manage the resulting consistency issues natively. In other words, when a relational database programmer denormalizes to achieve performance goals... why is he/she then left to maintain consistency via purpose-written SQL?
Maybe there's something obvious that I'm missing? Is there some reason why such a suggestion is silly, because it seems to me like having this capability might be awfully useful.
EDIT:
Appreciate the feedback so far. I still feel like I have an unanswered (perhaps because it's been poorly articulated) question on my hands. I understand that materialized views attempt to offer engine-managed consistency for denormalized data. However, my understanding is that they aren't updated immediately with changes to the underlying tables. If this is true, it means the engine really isn't managing the consistency issues resulting from the denormalization... at least not at write-time. What I'm getting at is that a normalized data structure without true, feature-rich, engine-managed denormalization hamstrings relational database engines when it comes time to scale a system with heavy read load against complex relational models. I suppose it's true that adjusting materialized view refresh rates equates to tunable "eventual consistency" offered by NoSQL engines like Cassandra. I need to read up on how efficiently engines are able to sync their materialized views. In order to be considered viable relative to NoSQL options, the time it takes to sync a view would need to increase linearly with the number of added/updated rows.
Anyway, I'll think about this some more and re-edit. Hopefully with some representative examples of imagined DDL.
Some relational database systems are able to maintain consistency of denormalized data to some extent (if I understand right what you mean).
In Oracle, this is called materialized views, in SQL Server — indexed views.
Basically, this means that you can create a self-maintaned denormalized table as a result of an SQL query and index it:
CREATE VIEW a_b
WITH SCHEMABINDING
AS
SELECT b.id AS id, b.value, b.a_id, a.property
FROM dbo.b b
JOIN dbo.a a
ON a.id = b.a_id
The resulting view, a_b, were it a real table, would violate 2NF since property is functionally dependent on a_id which is not a candidate key. However, the database system maintains this functional dependency and you can create a composite index on, say, (value, property).
Even MySQL and PostgreSQL which don't support materialized views natively are capable of maintaining some kind of denormalized tables.
For instance, when you create a FULLTEXT index on a column or a set of columns in MySQL, you get two indexes at once: first one contains one entry for each distinct word in each record (with a reference to the original record id), the second one contains one record per each word in the whole table, with the total word count. This allows searching for the words fast and ordering by relevance.
The total word count table is of course dependent on the individual words table and hence violates 5NF, but, again, the systems maintains this dependency.
Similar things are done for GIN and GIST indexes in PostgreSQL.
Of course not all possible denormalizations can be maintained, that means that you cannot materialize and index just any query in real time: some are too expensive to maintain, some are theoretically possible but not implemented in actual systems, etc.
However, you may maintain them using your own logic in triggers, stored procedures or whatever, that's exactly what they are there for.
Denormalisation in an RDBMS is a special case: not the standard. One only does this when you have a proven case. If you design in denormalised data up front, you've already lost.
Given each case is by definition "special", then how can there be standard SQL constructs to maintain the denormalised data.
An RDBMS differs from NoSQL in that it is designed to work with normalised designs. IMHO, you can't compare RDBMS and NoSQL like this