Would mysql with sharding (Vitess) be faster than any no SQL database? - sql

I have been trying to understand why nosql db is considered to be faster than RDBMS. I understand that nosql dbs don't follow the ACID properties but instead follows BASE principle which is the reason why nosql is able to horizontally scale.
What I am trying to understand here is that if there is a big difference in similar queries.
For instance let's say we have a search which searches all the matching users in our db. We have same data in MySQL as well as in any nosql db and let's assume that for nosql we have just 1 shard. So would there still be a difference in the speed of the query or would it be the same?

With disks getting faster and cheaper, the speed advantage may not be significant enough to make a difference.
However, you can do a lot more with systems like Vitess in terms of transactions, joins and indexes. At the same time, you can continue to scale like NoSQL systems. For this reason, I think Vitess is an overall better trade-off.

Related

Why are Relational databases said to be not good at scalability and what gives NOSQL databases the edge here?

Many articles claim that relational databases cannot be scaled and NOSQL is better at it but do not explain why. Scalability is often projected as an advantage of NOSQL. What is the problem with scaling relational databases? What makes NOSQL databases superior to relational databases in the aspect of scalability?
Both SQL and NOSQL databases can scale. However, NOSQL databases have some simplified functionality that can improve scalability.
For instance, SQL databases generally enforce a set of properties called ACID properties. These ensure the consistency of the data over time and the ability implement an entire transaction "all at once".
However, when running in a multi-processor environment, there is overhead to strictly maintaining the ACID properties. Basically, the data needs to look the same from any processor at the same time.
NOSQL databases often implement "ACID-lite". For instance, they offer "eventual-consistency". This means that for a few seconds or minutes, a query might return different values depending on which processor(s) process it. And, this is fine for many applications.
This truly depends on the requirement of the enterprise in long run and volume of the data expected. The other key factor is the requirement in terms of do we need OLTP kind of scenario only and reporting is less which means implementing ACID scenario. No SQL is usually best for the scenario where reporting is vital as compare to SQL. As both carry its own Mertis but ideally its hybrid model to take adavntage of both usually works better where you have scalability and better transaction control on SQL DB's and high performance rreporting using NO SQL DB which allow all level of freedom such as graph DB, Key value pair. There are lot of intresting comparision are available evne for specific DB you want to evaluate.
Puneet

How can NoSQL databases achieve much better write throughput than some relational databases?

How is this possible? What is it about NoSQL that gives it a higher write throughput than some RDBMS? Does it boil down to scalability?
Some noSQL systems are basically just persistent key/value storages (like Project Voldemort). If your queries are of the type "look up the value for a given key", such a system will (or at least should be) faster that an RDBMS, because it only needs to have a much smaller feature set.
Another popular type of noSQL system is the document database (like CouchDB). These databases have no predefined data structure. Their speed advantage relies heavily on denormalization and creating a data layout that is tailored to the queries that you will run on it. For example, for a blog, you could save a blog post in a document together with its comments. This reduces the need for joins and lookups, making your queries faster, but it also could reduce your flexibility regarding queries.
There are many NoSQL solutions around, each one with its own strengths and weaknesses, so the following must be taken with a grain of salt.
But essentially, what many NoSQL databases do is rely on denormalization and try to optimize for the denormalized case. For instance, say you are reading a blog post together with its comments in a document-oriented database. Often, the comments will be saved together with the post itself. This means that it will be faster to retrieve all of them together, as they are stored in the same place and you do not have to perform a join.
Of course, you can do the same in SQL, and denormalizing is a common practice when one needs performance. It is just that many NoSQL solutions are engineered from the start to be always used this way. You then get the usual tradeoffs: for instance, adding a comment in the above example will be slower because you have to save the whole document with it. And once you have denormalized, you have to take care of preserving data integrity in your application.
Moreover, in many NoSQL solutions, it is impossible to do arbitrary joins, hence arbitrary queries. Some databases, like CouchDB, require you to think ahead of the queries you will need and prepare them inside the DB.
All in all, it boils down to expecting a denormalized schema and optimizing reads for that situation, and this works well for data that is not highly relational and that requires much more reads than writes.
This link explains a lot moreover where:
RDBMS -> data integrity is a key feature (which can slow down some operations like writing)
NoSQL -> Speed and horizontal scalability are imperative (So speed is really high with this imperatve)
AAAND... The thing about NoSQL is that NoSQl cannot be compared to SQL in any way. NoSQL is name of all persistence technologies that are not SQL. Document DBs, Key-Value DBs, Event DBs are all NoSQL. They are all different in almost all aspects, be it structure of saved data, querying, performance and available tools.
Hope it is useful to understand
In summary, NoSQL databases are built to easily scale across a large number of servers (by sharding/horizontal partitioning of data items), and to be fault tolerant (through replication, write-ahead logging, and data repair mechanisms). Furthermore, NoSQL supports achieving high write throughput (by employing memory caches and append-only storage semantics), low read latencies (through caching and smart storage data models), and flexibility (with schema-less design and denormalization).
From:
Open Journal of Databases (OJDB)
Volume 1, Issue 2, 2014
www.ronpub.com/journals/ojdb
ISSN 2199-3459
https://estudogeral.sib.uc.pt/bitstream/10316/27748/1/Which%20NoSQL%20Database.pdf - page 19
A higher write throughput can also be credited to the internal data structures that power the database storage engine.
Even though B-tree implementations used by some RDBMS have stood the test of time, LSM-trees used in some key-value datastores are typically faster for writes:
1: When a write comes, you add it to the in-memory balanced tree, called memtable.
2: When the memtable grows big, it is flushed to the disk.
To understand this data structure better, please check this video and this answer.

access sql database as nosql (couchbase)

I hope to access sql database as the way of nosql key-value pairs/document.
This is for future upgrade if user amount increases a lot,
I can migrate from sql to nosql immediately while application code changes nothing.
Of course I can write the api/solution by myself, just wonder if there is any person has done same thing as I said before and published the solution.
Your comment welcome
While I agree with everything scalabilitysolved has said, there is an interesting feature in the offing for Postgres, scheduled for the 9.4 Postgres release, namely, jsonb: http://www.postgresql.org/docs/devel/static/datatype-json.html with some interesting indexing and query possibilities. I mention this as you tagged Mongodb and Couchbase, both of which use JSON (well, technically BSON in Mongodb's case).
Of course, querying, sharding, replication, ACID guarantees, etc will still be totally different between Postgres (or any other traditional RDBMS) and any document-based NoSQL solution and migrations between any two RDBMS tends to be quite painful, let alone between an RDBMS and a NoSQL data store. However, jsonb looks quite promising as a potential half-way house between two of the major paradigms of data storage.
On the other hand, every release of MongoDB brings enhancements to the aggregation pipeline, which is certainly something that seems to appeal to people used to the flexibility that SQL offers and less "alien" than distributed map/reduce jobs. So, it seems reasonable to conclude that there will continue to be cross pollination.
See Explanation of JSONB introduced by PostgreSQL for some further insights into jsonb.
No no no, do not consider this, it's a really bad idea. Pick either a RDBMS or NoSQL solution based upon how your data is modelled and your usage patterns. Converting from one to another is going to be painful and especially if your 'user amount increases a lot'.
Let's face it, either approach would deal with a large increase in usage and both would benefit more from specific optimizations to their database then simply swapping because one 'scales more'.
If your data model fits RDBMS and it needs to perform better than analyze your queries, check your indexes are optimized and look into caching and better data pattern access.
If your data model fits a NoSQL database then as your dataset grows you can add additional nodes (Couchbase),caching expensive map reduce jobs and again optimizing your data pattern access.
In summary, pick either SQL or NoSQL dependent on your data needs, don't just assume that NoSQL is a magic bullet as with easier scaling comes a much less flexible querying model.

SQL Database Query Performance Chart on Large Datasets?

What are the best docs/articles out there that show how different queries perform on large data sets? I'm trying to get a feel approaches are better than others in building a dashboard where the number and type of queries easily becomes large and complex, and slow.
Suggest you take the time to get a book on performance tuning for the database backend you are using. Performance tuning is very database specific and techiniques that are fast on one database may be dog slow on another (Cursors on Oracle and SQL Server comes to mind). If you are going to be writing complex queries, you need to understand performance tuning in depth before trying to write them. I will give you a hint - joins are usually faster than correlated subqueries whcih developers often use becasue they seem more straightforward.
It is not easy to compare the actual performance of one database architecture over another, I think the most common you'll find are benchmarks, which will vary, or theoretical proposals.
It's certainly harder to compare SQL vs NoSQL because you need a lot of data in the first place to have NoSQL be beneficial.
Minimizing joins, indexing properly, and having enough memory/cpu availability are going to be the three dominating factors.

SimpleDB as Denormalized DB

In an environment where you have a relational database which handles all business transactions is it a good idea to utilise SimpleDB for all data queries to have faster and more lightweight search?
So the master data storage would be a relational DB which is "replicated"/"transformed" into SimpleDB to provide very fast read only queries since no JOINS and complicated subselects are needed.
What you're considering smells of premature optimization ...
Have you benchmarked your application? Have you identified your search queries as a performance bottleneck? Have you correctly implemented indexes into your database?
IF (and that's a big if) there's no way using a relational database to offer decent search times to your users, going NOSQL might be something worth considering ... but not before !
SimpleDB is a good technology but its claim to fame is not faster queries than a relational database. Offloading queries to replicated SimpleDB is not likely to significantly improve your query response time.
I am still finding it hard to believe, but our experiments show that a round trip from and EC2 instance to simpledb is averaging out to 300milliseconds or so, on a good day! On a bad day we've seen it go down to 1.5sec. This is for a single insert. I'd love to see somebody replicate the experiment to verify these results, but as it is... simpledb is no solution for anything but post processing- in the request/response cycle it would just be way to slow.
If the data is largely read-only, try using indexed views. Otherwise, cache the data in the application.