What are the best practices to design a scalable database? If database / tables span on multiple servers, how can I join them?
Where can I get more information for this?
Here is a white-paper on how to scale out mysql
Scale out mysql
Although, IMO, if you are looking to scale a database using multiple servers, you should take a look (use) no-sql databases. Here is a link where you can start your research
nosql-database org
Related
One of the advantages of NoSQL databases is to handle unstructured data. Since that issue is now resolved in SQL databases, is there any need left for NoSQL? Only advantage that I can think of is NoSQL is still better at scalability.
You might choose a NoSQL database for the following reasons:
To store large volumes of data that might have little to no
structure.
NoSQL databases do not limit the types of data that you can store
together. NoSQL databases also enable you to add new data types as
your needs change. With document-oriented databases, you can store
data in one place without having to define the data type in advance.
To make the most of cloud computing and storage.
In order for a cloud solution to be scalable, the data must be easy
to share across multiple servers.
To speed development.
When you are developing in rapid iterations or making frequent
updates to the data structure, a relational database slows you down.
However, because NoSQL data doesn’t need to be prepped ahead of time,
you can make frequent updates to the data structure with minimal
downtime.
To boost horizontal scalability.
The CAP (consistency, availability, and partition tolerance) theorem
states that in any distributed system, only two of the three CAP
properties can be used simultaneously. Adjusting these properties in
favor of strong partition tolerance enables NoSQL users to boost
horizontal scalability.
The following Link provides sufficient details about the requirement of NoSQL databases.
https://support.rackspace.com/how-to/reasons-to-use-a-nosql-db/
Many articles claim that relational databases cannot be scaled and NOSQL is better at it but do not explain why. Scalability is often projected as an advantage of NOSQL. What is the problem with scaling relational databases? What makes NOSQL databases superior to relational databases in the aspect of scalability?
Both SQL and NOSQL databases can scale. However, NOSQL databases have some simplified functionality that can improve scalability.
For instance, SQL databases generally enforce a set of properties called ACID properties. These ensure the consistency of the data over time and the ability implement an entire transaction "all at once".
However, when running in a multi-processor environment, there is overhead to strictly maintaining the ACID properties. Basically, the data needs to look the same from any processor at the same time.
NOSQL databases often implement "ACID-lite". For instance, they offer "eventual-consistency". This means that for a few seconds or minutes, a query might return different values depending on which processor(s) process it. And, this is fine for many applications.
This truly depends on the requirement of the enterprise in long run and volume of the data expected. The other key factor is the requirement in terms of do we need OLTP kind of scenario only and reporting is less which means implementing ACID scenario. No SQL is usually best for the scenario where reporting is vital as compare to SQL. As both carry its own Mertis but ideally its hybrid model to take adavntage of both usually works better where you have scalability and better transaction control on SQL DB's and high performance rreporting using NO SQL DB which allow all level of freedom such as graph DB, Key value pair. There are lot of intresting comparision are available evne for specific DB you want to evaluate.
Puneet
There's been a lot of hype about NoSQL databases being used by big sites like Twitter and Facebook. However, as I've looked into this more carefully, it seems like most of the successful companies in this space have been using a combination of database technologies, using MySQL as the main database and then adding NoSQL databases for things like adding a caching layer to improve performance. I've also heard that Diaspora originally started out using MongoDB as their primary database, and then had to switch to a relational database because Mongo turned out to be ill-suited to their needs. In particular, I've heard that representing relationships between users really calls for a relational database or maybe a graph database.
However, Spotify seems to be really big on Cassandra, which is neither a relational database nor a graph database. Furthermore, while Spotify isn't known for social networking, it does include features like being able to follow other users and see what songs they've been listening to. If this is all done with Cassandra, maybe Cassandra is well-suited for social networking, even representing relationships between users? Can anyone give me any insight into this?
EDIT: I know Cassandra doesn't support joins, but is there a reasonable way to represent a social graph with Cassandra in spite of lack of joins? Also, I'm especially interested in Cassandra vs. SQL for social graph, less interested in Cassandra vs. graph DB.
Cassandra is very good for high speed writes and reads using simple key-values, or bigtable-esque slices within a partition.
Cassandra is very bad at anything that you would model as a SQL JOIN, or searching for arbitrary text.
The reason people tend to use a combination of technologies is that different tech is designed for different problems - a tool optimized for searching (elasticsearch, solr, etc) is going to be much better at search-type problems, but won't have the read/write throughput for key/value lookups that you'll get from Cassandra.
They all have different use cases and a single database may not suffice.
For a social networking site, a combination of these may be used. A SQL or NoSQL db may be used for storing user information, preferences, and the like, depending upon what scale you're looking at.
Relationship requirements (social network) are however different, and both SQL and NoSQL databases (including Cassandra) would be a bad choice to represent these.
A graph database tends to be an order of magnitude faster and efficient in representing a social graph and executing related algorithms.
One of our customer is in manufacturing domain. He has multiple factories across the country. For the quality control, he is using window application deployed independently in all factories (approx 100 in count). Our customer is interested in replacing all the window applications with a single web application. Now the problem is volume of the data will be 100 times bigger and same as the velocity (in case we keep a single database for all the factories). There are lots of reporting use cases in this domain. Looking at the numbers, it looks like SQL will be not be able to handle this much load.
Is it a valid use case to move to NoSQL database?
Can Volume/Velocity alone be a deciding factor to move to NoSQL?
Would we be able to get all those reporting from NoSQL database efficiently?
Any kind of help will be appreciated.
Thanks In Advance
This is a usefull discussion.
In my opinion a well designed MS-SQL server 2012 (or Oracle server, but no experience for me) must be capable of handling 1000 complex transactions per second.
MS-SQL server 2014 with in-memory processing raises even higher expecations.
Consider multi processor, large memories, table partitioning, file mapping, multiple access paths to the SAN or to separate server discs. Use well designed transactions (consider to remove most indexes on transaction tables).
As an extra benefit you keep all functionality of the SQL server. In my opinion most NOSQL solutions are NOSQL because they are deprived of essential SQL functionality.
Switch to NOSQL databases is most usefull when you require functionality outside the transaction domain, e.g. document indexing or network indexing.
I am asking this in the context of NoSQL - which achieves scalability and performance without being expensive.
So, if I needed to achieve massively parallel distributed computing across databases ...
What are the various methodologies available today (within the RDBMS paradigm) to achieve distributed computing with high-scalability?
Does database clustering & mirroring contribute in any way towards distributed computing?
I guess you are asking about scalability of RDBMS databases. Talking about NoSQL databases based on ( amazon dynamo, BigTable ) are a whole another topic. I am talking about HBase, Cassandra etc. There are also commerical products like Oracle Coherence thats more like a distributed cache and key value store , to put it crudely.
going back to rdbms,
Sharding
to scale RDBMS one can do cusstom sharding. Sharding is a technique where you have multiple table is possibly multiple hosts. And then you decide in a certain fashion to assign certain rows to certain tables. For example you can say that rows 1-1M goes to table1, 1M-2M goes to table2 etc. But, this is a difficult process from an administration point of view. A lot of large scale websites scale by relying on sharding. Other techniques worth mentioning are partioning and mysql federation and mysql cluster.
MPP databases
Then there are databases are there very RDBMS which does distribution and scaling for you. Terradata is the most successful of these companies. I believe they used postgres core code at some point. A significant number of fortune 500 companies and a lot of the airlines use Terradata. But, its ridiculously expensive. There are newer companies like greenplum, vertica, netezza.
Unless you're a very big company with extreme scalability requirements, you can horizontally and ACID scale up your DB by building a cluster of identical RDBMS instances and synchronizing them with JTA transactions.
Take a look to this Java/JDBC based article the JEPLayer framework is used but you can use straight JDBC and JTA code.
Within the RDBMS paradigm: Sharding.
Outside the RDBMS paradigm: Key-value stores.
My pick: (I come from an RDBMS background) Key-value stores of the tabluar type - HBase.
Within the RDBMS paradigm, sharding will not get you far.
Use the RDBMS paradigm to design your model, to get your project up and running.
Use tabular key-value stores to SCALE OUT.
Sharding:
A good way to think about sharding is to see it as user-account-oriented
DB design.
The all schema entities touched by a user-account are kept on one host.
The assignment of user to host happens when the user creates an account.
The least loaded host gets that user.
When that user signs on after account creation, he gets connected
to the host that has his data.
Each host has a set of user accounts.
The problem with this approach is that if the host gets hosed,
a fraction of users will be blacked out.
The solution to this is have a replicated standby host that
becomes the primary when the primary host encounters problems.
Also, it's a fairly rigid setup for processes where the design does
not change dramatically.
From the user standpoint, I've noticed that web sites
with a sharded DB backend are not as quick to "turn on a dime"
to create different business models on their platform.
Contrast this with web sites that have truly distributed
key-value stores. These businesses can host any range of
services. Their platform is just that - a platform.
It's not relational and it does have an API interface,
but it just seems to work.