Sharding a RDBMS (SQL) database - sql

I am reading about sharding and I understood it upto some context. But most of the material I read says that sharding (horizontally scaling) RDBMS is a challenging task. But I don't see why NO-SQL is easy to shard and RDBMS would be tough to shard?
My understanding is: some NO-SQL provides inbuilt sharding support which makes it easy to shard. But if the NO-SQL does not provide inbuilt sharding support, then sharding overhead in SQL/NO-SQL is same thing as it has to be implemented in application layer.
Is my understanding correct or did I miss anything?

I don't think sharding is particularly "harder" in a SQL versus a NO-SQL database from the user perspective. After all, the complicated stuff is all done "under the hood", so the interface for users is pretty similar.
Sharding means that rows of a given table are stored separately -- often in local storage on different nodes. The issue is keeping them up-to-date.
One key difference is that SQL enforces ACID properties on the data, in particular "consistency". This means that queries see the database only after transactions have been completed entirely or not at all.
NO-SQL databases typically implement eventual consistency. That is, a given transaction may take some time (typically measured in seconds up to a minute) before the transaction completes across all shards.
Consider the situation where a query is deleting one row in each shard. A SQL database will either see all rows deleted or none. A NO-SQL database might return intermediate results.
The advantage of NO-SQL is that large databases are often append-only and transactions only affect one shard -- so eventual consistency is quite good-enough.
The advantage of SQL databases is that consistency is guaranteed (well, in some databases you can fiddle with settings to weaken it). However, there is a higher cost of waiting for all shards to agree that a transaction has completed.
I will note that in some situations SQL databases have a tremendous application advantage -- because the applications do not need to deal with potentially inconsistent data.

Related

These days SQL databases can store JSON. Then why do we need NoSQL?

One of the advantages of NoSQL databases is to handle unstructured data. Since that issue is now resolved in SQL databases, is there any need left for NoSQL? Only advantage that I can think of is NoSQL is still better at scalability.
You might choose a NoSQL database for the following reasons:
To store large volumes of data that might have little to no
structure.
NoSQL databases do not limit the types of data that you can store
together. NoSQL databases also enable you to add new data types as
your needs change. With document-oriented databases, you can store
data in one place without having to define the data type in advance.
To make the most of cloud computing and storage.
In order for a cloud solution to be scalable, the data must be easy
to share across multiple servers.
To speed development.
When you are developing in rapid iterations or making frequent
updates to the data structure, a relational database slows you down.
However, because NoSQL data doesn’t need to be prepped ahead of time,
you can make frequent updates to the data structure with minimal
downtime.
To boost horizontal scalability.
The CAP (consistency, availability, and partition tolerance) theorem
states that in any distributed system, only two of the three CAP
properties can be used simultaneously. Adjusting these properties in
favor of strong partition tolerance enables NoSQL users to boost
horizontal scalability.
The following Link provides sufficient details about the requirement of NoSQL databases.
https://support.rackspace.com/how-to/reasons-to-use-a-nosql-db/

Why are Relational databases said to be not good at scalability and what gives NOSQL databases the edge here?

Many articles claim that relational databases cannot be scaled and NOSQL is better at it but do not explain why. Scalability is often projected as an advantage of NOSQL. What is the problem with scaling relational databases? What makes NOSQL databases superior to relational databases in the aspect of scalability?
Both SQL and NOSQL databases can scale. However, NOSQL databases have some simplified functionality that can improve scalability.
For instance, SQL databases generally enforce a set of properties called ACID properties. These ensure the consistency of the data over time and the ability implement an entire transaction "all at once".
However, when running in a multi-processor environment, there is overhead to strictly maintaining the ACID properties. Basically, the data needs to look the same from any processor at the same time.
NOSQL databases often implement "ACID-lite". For instance, they offer "eventual-consistency". This means that for a few seconds or minutes, a query might return different values depending on which processor(s) process it. And, this is fine for many applications.
This truly depends on the requirement of the enterprise in long run and volume of the data expected. The other key factor is the requirement in terms of do we need OLTP kind of scenario only and reporting is less which means implementing ACID scenario. No SQL is usually best for the scenario where reporting is vital as compare to SQL. As both carry its own Mertis but ideally its hybrid model to take adavntage of both usually works better where you have scalability and better transaction control on SQL DB's and high performance rreporting using NO SQL DB which allow all level of freedom such as graph DB, Key value pair. There are lot of intresting comparision are available evne for specific DB you want to evaluate.
Puneet

How can NoSQL databases achieve much better write throughput than some relational databases?

How is this possible? What is it about NoSQL that gives it a higher write throughput than some RDBMS? Does it boil down to scalability?
Some noSQL systems are basically just persistent key/value storages (like Project Voldemort). If your queries are of the type "look up the value for a given key", such a system will (or at least should be) faster that an RDBMS, because it only needs to have a much smaller feature set.
Another popular type of noSQL system is the document database (like CouchDB). These databases have no predefined data structure. Their speed advantage relies heavily on denormalization and creating a data layout that is tailored to the queries that you will run on it. For example, for a blog, you could save a blog post in a document together with its comments. This reduces the need for joins and lookups, making your queries faster, but it also could reduce your flexibility regarding queries.
There are many NoSQL solutions around, each one with its own strengths and weaknesses, so the following must be taken with a grain of salt.
But essentially, what many NoSQL databases do is rely on denormalization and try to optimize for the denormalized case. For instance, say you are reading a blog post together with its comments in a document-oriented database. Often, the comments will be saved together with the post itself. This means that it will be faster to retrieve all of them together, as they are stored in the same place and you do not have to perform a join.
Of course, you can do the same in SQL, and denormalizing is a common practice when one needs performance. It is just that many NoSQL solutions are engineered from the start to be always used this way. You then get the usual tradeoffs: for instance, adding a comment in the above example will be slower because you have to save the whole document with it. And once you have denormalized, you have to take care of preserving data integrity in your application.
Moreover, in many NoSQL solutions, it is impossible to do arbitrary joins, hence arbitrary queries. Some databases, like CouchDB, require you to think ahead of the queries you will need and prepare them inside the DB.
All in all, it boils down to expecting a denormalized schema and optimizing reads for that situation, and this works well for data that is not highly relational and that requires much more reads than writes.
This link explains a lot moreover where:
RDBMS -> data integrity is a key feature (which can slow down some operations like writing)
NoSQL -> Speed and horizontal scalability are imperative (So speed is really high with this imperatve)
AAAND... The thing about NoSQL is that NoSQl cannot be compared to SQL in any way. NoSQL is name of all persistence technologies that are not SQL. Document DBs, Key-Value DBs, Event DBs are all NoSQL. They are all different in almost all aspects, be it structure of saved data, querying, performance and available tools.
Hope it is useful to understand
In summary, NoSQL databases are built to easily scale across a large number of servers (by sharding/horizontal partitioning of data items), and to be fault tolerant (through replication, write-ahead logging, and data repair mechanisms). Furthermore, NoSQL supports achieving high write throughput (by employing memory caches and append-only storage semantics), low read latencies (through caching and smart storage data models), and flexibility (with schema-less design and denormalization).
From:
Open Journal of Databases (OJDB)
Volume 1, Issue 2, 2014
www.ronpub.com/journals/ojdb
ISSN 2199-3459
https://estudogeral.sib.uc.pt/bitstream/10316/27748/1/Which%20NoSQL%20Database.pdf - page 19
A higher write throughput can also be credited to the internal data structures that power the database storage engine.
Even though B-tree implementations used by some RDBMS have stood the test of time, LSM-trees used in some key-value datastores are typically faster for writes:
1: When a write comes, you add it to the in-memory balanced tree, called memtable.
2: When the memtable grows big, it is flushed to the disk.
To understand this data structure better, please check this video and this answer.

Join SQL with NoSQL databases

I just wanted to know does it will make any sense to join sql database with nosql database?
Yes it makes sense, one of the big advantage of NoSQL data storage is that data is not tight to specific schema.
One fundamental difference between SQL and NOSQL dB's is support for transactions.
Imagine you were writing a banking app that keeps account balances. You will not be able to achieve accurate balance values unless you use transactions. This is common in all SQL dbs that support ACID semantics.
However support for transactions is not available in NOSql. Therefore NOSql is not suited for any project that needs transactions.
That said, if the same banking app needs tremendous scale, it can be built such that all non transactional data or data that can tolerate "eventual consistency" can use NOSql and data that needs transaction support can be stored in an SQL db.
The advantage of this design would be the benefit of automatic sharding or splitting of data that NOSql DBs provide that allows them to scale easily. In effect, maintenance needs of the DB can be significantly reduced by choosing for a hybrid model such as this.

access sql database as nosql (couchbase)

I hope to access sql database as the way of nosql key-value pairs/document.
This is for future upgrade if user amount increases a lot,
I can migrate from sql to nosql immediately while application code changes nothing.
Of course I can write the api/solution by myself, just wonder if there is any person has done same thing as I said before and published the solution.
Your comment welcome
While I agree with everything scalabilitysolved has said, there is an interesting feature in the offing for Postgres, scheduled for the 9.4 Postgres release, namely, jsonb: http://www.postgresql.org/docs/devel/static/datatype-json.html with some interesting indexing and query possibilities. I mention this as you tagged Mongodb and Couchbase, both of which use JSON (well, technically BSON in Mongodb's case).
Of course, querying, sharding, replication, ACID guarantees, etc will still be totally different between Postgres (or any other traditional RDBMS) and any document-based NoSQL solution and migrations between any two RDBMS tends to be quite painful, let alone between an RDBMS and a NoSQL data store. However, jsonb looks quite promising as a potential half-way house between two of the major paradigms of data storage.
On the other hand, every release of MongoDB brings enhancements to the aggregation pipeline, which is certainly something that seems to appeal to people used to the flexibility that SQL offers and less "alien" than distributed map/reduce jobs. So, it seems reasonable to conclude that there will continue to be cross pollination.
See Explanation of JSONB introduced by PostgreSQL for some further insights into jsonb.
No no no, do not consider this, it's a really bad idea. Pick either a RDBMS or NoSQL solution based upon how your data is modelled and your usage patterns. Converting from one to another is going to be painful and especially if your 'user amount increases a lot'.
Let's face it, either approach would deal with a large increase in usage and both would benefit more from specific optimizations to their database then simply swapping because one 'scales more'.
If your data model fits RDBMS and it needs to perform better than analyze your queries, check your indexes are optimized and look into caching and better data pattern access.
If your data model fits a NoSQL database then as your dataset grows you can add additional nodes (Couchbase),caching expensive map reduce jobs and again optimizing your data pattern access.
In summary, pick either SQL or NoSQL dependent on your data needs, don't just assume that NoSQL is a magic bullet as with easier scaling comes a much less flexible querying model.