Vertical AND Horizontal scaling in databases - sql

I am new to databases and am currently comparing RDBMS and Key-Value database systems. I understand that Key-Value database systems like NoSQL are optimized for horizontal scaling and relational database systems are optimized for vertical scaling. Is there a reason why vertical scaling is not efficient in K-V database systems? If not, why aren't K-V database systems used everywhere?

It's not as simple as you think.
There are a couple of articles and talks on this controversial issue. While NoSQL Systems have many benefits but obviously they have underlying issues. Just to mention a few, you may consider NoSQL databases are relevantly new and an organization needs to invest in educating its engineers in order to make use of NoSQL. On the other hand, SQL is too old that means it has a handhold of useful tools for monitoring, analyzing, logging, etc. And the most important problem, is due to CAP principle, you may not have a database architecture that can handle Consistency, Availability and Partition tolerance together. So, NoSQLs are losing some features in order to scale, e.g. they might not be consistent but they have gradual consistency.
But I recommend you to go further and not only rely on my answer. Technology is changing fast and tomorrow you may consider my opinion deprecated! While this is not new, but I consider this article as a good starting point.

Related

Why are Relational databases said to be not good at scalability and what gives NOSQL databases the edge here?

Many articles claim that relational databases cannot be scaled and NOSQL is better at it but do not explain why. Scalability is often projected as an advantage of NOSQL. What is the problem with scaling relational databases? What makes NOSQL databases superior to relational databases in the aspect of scalability?
Both SQL and NOSQL databases can scale. However, NOSQL databases have some simplified functionality that can improve scalability.
For instance, SQL databases generally enforce a set of properties called ACID properties. These ensure the consistency of the data over time and the ability implement an entire transaction "all at once".
However, when running in a multi-processor environment, there is overhead to strictly maintaining the ACID properties. Basically, the data needs to look the same from any processor at the same time.
NOSQL databases often implement "ACID-lite". For instance, they offer "eventual-consistency". This means that for a few seconds or minutes, a query might return different values depending on which processor(s) process it. And, this is fine for many applications.
This truly depends on the requirement of the enterprise in long run and volume of the data expected. The other key factor is the requirement in terms of do we need OLTP kind of scenario only and reporting is less which means implementing ACID scenario. No SQL is usually best for the scenario where reporting is vital as compare to SQL. As both carry its own Mertis but ideally its hybrid model to take adavntage of both usually works better where you have scalability and better transaction control on SQL DB's and high performance rreporting using NO SQL DB which allow all level of freedom such as graph DB, Key value pair. There are lot of intresting comparision are available evne for specific DB you want to evaluate.
Puneet

How can NoSQL databases achieve much better write throughput than some relational databases?

How is this possible? What is it about NoSQL that gives it a higher write throughput than some RDBMS? Does it boil down to scalability?
Some noSQL systems are basically just persistent key/value storages (like Project Voldemort). If your queries are of the type "look up the value for a given key", such a system will (or at least should be) faster that an RDBMS, because it only needs to have a much smaller feature set.
Another popular type of noSQL system is the document database (like CouchDB). These databases have no predefined data structure. Their speed advantage relies heavily on denormalization and creating a data layout that is tailored to the queries that you will run on it. For example, for a blog, you could save a blog post in a document together with its comments. This reduces the need for joins and lookups, making your queries faster, but it also could reduce your flexibility regarding queries.
There are many NoSQL solutions around, each one with its own strengths and weaknesses, so the following must be taken with a grain of salt.
But essentially, what many NoSQL databases do is rely on denormalization and try to optimize for the denormalized case. For instance, say you are reading a blog post together with its comments in a document-oriented database. Often, the comments will be saved together with the post itself. This means that it will be faster to retrieve all of them together, as they are stored in the same place and you do not have to perform a join.
Of course, you can do the same in SQL, and denormalizing is a common practice when one needs performance. It is just that many NoSQL solutions are engineered from the start to be always used this way. You then get the usual tradeoffs: for instance, adding a comment in the above example will be slower because you have to save the whole document with it. And once you have denormalized, you have to take care of preserving data integrity in your application.
Moreover, in many NoSQL solutions, it is impossible to do arbitrary joins, hence arbitrary queries. Some databases, like CouchDB, require you to think ahead of the queries you will need and prepare them inside the DB.
All in all, it boils down to expecting a denormalized schema and optimizing reads for that situation, and this works well for data that is not highly relational and that requires much more reads than writes.
This link explains a lot moreover where:
RDBMS -> data integrity is a key feature (which can slow down some operations like writing)
NoSQL -> Speed and horizontal scalability are imperative (So speed is really high with this imperatve)
AAAND... The thing about NoSQL is that NoSQl cannot be compared to SQL in any way. NoSQL is name of all persistence technologies that are not SQL. Document DBs, Key-Value DBs, Event DBs are all NoSQL. They are all different in almost all aspects, be it structure of saved data, querying, performance and available tools.
Hope it is useful to understand
In summary, NoSQL databases are built to easily scale across a large number of servers (by sharding/horizontal partitioning of data items), and to be fault tolerant (through replication, write-ahead logging, and data repair mechanisms). Furthermore, NoSQL supports achieving high write throughput (by employing memory caches and append-only storage semantics), low read latencies (through caching and smart storage data models), and flexibility (with schema-less design and denormalization).
From:
Open Journal of Databases (OJDB)
Volume 1, Issue 2, 2014
www.ronpub.com/journals/ojdb
ISSN 2199-3459
https://estudogeral.sib.uc.pt/bitstream/10316/27748/1/Which%20NoSQL%20Database.pdf - page 19
A higher write throughput can also be credited to the internal data structures that power the database storage engine.
Even though B-tree implementations used by some RDBMS have stood the test of time, LSM-trees used in some key-value datastores are typically faster for writes:
1: When a write comes, you add it to the in-memory balanced tree, called memtable.
2: When the memtable grows big, it is flushed to the disk.
To understand this data structure better, please check this video and this answer.

why is sql vertically scalable and nosql horizontally

I am new to NoSQL and trying to understand it's meaning.
I have seen many articles in many different websites that repeat the fact that "SQL DataBases are scaled vertically (by adding CPU/memory) whereas NoSQL DataBases are scaled horizontally (by adding more machines that can perform distributed calculations)".
For example these articles:
http://dataconomy.com/sql-vs-nosql-need-know/
http://www.thegeekstuff.com/2014/01/sql-vs-nosql-db/
The thing is that I can't understand why.
As far as I am aware, the major difference between SQL and NoSQL (besides the scalability issue) is that SQL is stored in tables, whereas NoSQL is stored in different ways (Key-Value/Graph/xml, etc..).
I can't seem to understand the connection between those two facts (scalability and storing strategy). These seem like unrelated things to me (probably due to lack of understanding).
The articles are generally reasonable. Both NoSQL technologies and SQL technologies (for lack of a better term) have important roles to play nowadays --- as both articles point out. The discussion is somewhat reminiscent of hierarchical databases versus relational databases, once upon a time.
I disagree with the scalability differences. The discussions leave out technologies such as Hive, PrestoDB, and BigQuery, which are based on highly scalable technologies in the spirit of traditional RDBMSs.
The major differences between RDBMS and NoSQL (in my opinion) are ACID-compliance and data structure. The first is a "burden" that relational databases carry, for both better and worse -- definitely handy for financial transactions, but at the cost of overhead for other purposes. The second is an area where traditional databases are moving towards better handling of unstructured data, with direct support for nested tables, JSON, and XML formats. However, structure is important, as legions of data scientists probably learn the hard way as they interact with data.
Large scalable key-value databases have been designed with "horizontal" scalability in mind. That combined with the lack of pure ACID properties facilitates re-balancing the data for new hardware -- assuming you have designed the database correctly (and that can be a large assumption).
Databases such as Oracle, DB2, and Teradata have supported parallel processing literally for decades (although more biased toward a single server, albeit with shared-nothing architectures). Their technology pre-dates the more modern Apache-based systems (for lack of a better term), but it doesn't mean that they cannot scale across multiple processors.
New databases such as Hive, Redshift, BigQuery, and PrestoDB provide SQL-based interfaces in the more modern "horizontally" scalable sense (at least for queries). A lot of work is going on in the Postgres world to support parallel processing there -- and the example of databases such as Greenplum, Netezza, Vertica, and so on belie the idea that relational databases do not scale across multiple independent processors.

access sql database as nosql (couchbase)

I hope to access sql database as the way of nosql key-value pairs/document.
This is for future upgrade if user amount increases a lot,
I can migrate from sql to nosql immediately while application code changes nothing.
Of course I can write the api/solution by myself, just wonder if there is any person has done same thing as I said before and published the solution.
Your comment welcome
While I agree with everything scalabilitysolved has said, there is an interesting feature in the offing for Postgres, scheduled for the 9.4 Postgres release, namely, jsonb: http://www.postgresql.org/docs/devel/static/datatype-json.html with some interesting indexing and query possibilities. I mention this as you tagged Mongodb and Couchbase, both of which use JSON (well, technically BSON in Mongodb's case).
Of course, querying, sharding, replication, ACID guarantees, etc will still be totally different between Postgres (or any other traditional RDBMS) and any document-based NoSQL solution and migrations between any two RDBMS tends to be quite painful, let alone between an RDBMS and a NoSQL data store. However, jsonb looks quite promising as a potential half-way house between two of the major paradigms of data storage.
On the other hand, every release of MongoDB brings enhancements to the aggregation pipeline, which is certainly something that seems to appeal to people used to the flexibility that SQL offers and less "alien" than distributed map/reduce jobs. So, it seems reasonable to conclude that there will continue to be cross pollination.
See Explanation of JSONB introduced by PostgreSQL for some further insights into jsonb.
No no no, do not consider this, it's a really bad idea. Pick either a RDBMS or NoSQL solution based upon how your data is modelled and your usage patterns. Converting from one to another is going to be painful and especially if your 'user amount increases a lot'.
Let's face it, either approach would deal with a large increase in usage and both would benefit more from specific optimizations to their database then simply swapping because one 'scales more'.
If your data model fits RDBMS and it needs to perform better than analyze your queries, check your indexes are optimized and look into caching and better data pattern access.
If your data model fits a NoSQL database then as your dataset grows you can add additional nodes (Couchbase),caching expensive map reduce jobs and again optimizing your data pattern access.
In summary, pick either SQL or NoSQL dependent on your data needs, don't just assume that NoSQL is a magic bullet as with easier scaling comes a much less flexible querying model.

Which choice of technology for this?

I face the following problem.
The target is to develop a DB to store the following schema:
You have PRODUCTS that can be composed of both PRIMARY_PRODUCTS and also other PRODUCTS.
My first question is to know which one of SQL DB or NoSQL technology would be recommended for this?
I don't really know well NoSQL and I am not sure it is worth spending time investigating if the whole concept is not suited with the pb.
If NoSQL is worth looking at, which version is recommended? I was looking at Cassandra but there are so many types that the universe is quite big.
If NoSQL is not suited for this, so we need to revert to SQL.
Do you thing that hierarchyId is suited?
Both SQL or NoSQL can store and retrieve data of this kind, and both technologies can be made to do this job.
The major differences are elsewhere: in a nutshell, transactions and guaranteed consistency for SQL versus high performance for readers for NoSQL.
In your precise situation SQL, with its support for transactions, will ensure that viewers will see a composite product when all sub-products have been successfully stored.
In most real-life situations, however, the chance of a viewer seeing a partially-committed product on a NoSQL system is so slim as to be irrelevant: future reads of the product will be correct.