Hypertable v. HBase and BigTable v. SQL - sql

As Hypertable and HBase seem to be the two major open source BigTable implementations, what are the major pros and cons between these two databases?
In addition, what are the major pros and cons between BigTable and SQL RDBMSes, and what significant differences can I expect between writing a project with a traditional RDBMS like Postgres and Hypertable?

At the risk of broadening your second question more than I should (I've never played with BigTable, but I've toyed with MongoDB and CouchDB)...
The most important difference, in so far as I've understood it anyway, is that RDBMS all use a row-based store, whereas NoSQL engines use a column-based store. The pros and cons mostly derives from this point.
http://en.wikipedia.org/wiki/Column-oriented_DBMS
The major consideration that I tend to keep in mind is ACID compliance: a NoSQL engine is eventually consistent, rather than always consistent. Think of it like a storage that behaves like a website's cache: the latter is normally valid and consistent, but occasionally slightly outdated/inconsistent.
There's no right or wrong here: for some use-cases (e.g. a search engine, a blog), slightly inconsistent is a very acceptable option; for others (e.g. a bank, a billing system) it is not. (I tend to work on stuff that needs atomicity.)
Then, there are plenty of performance considerations that break down to implementation details.
An immediate consequence of striving for eventual consistency is that integrity checks and so forth are typically done in the app rather than the data store (i.e. there are no triggers or stored procedures to speak of). Your data store ends up with less work to do, which results in obvious performance benefits.
A column-based store means that if you update a single column from your document, you only invalidate that column. A row-based store, by contrast, invalidates the entire row. Depending on how you typically update your data (i.e. just a few columns vs most of them), either approach can add up.
A flip side of a column-based store is that it makes joins trickier (from an implementation standpoint). In overly simplistic terms, think of it as having an EAV table per column; this works fine for a few tables. It's a different story if you need a big report that requires a dozen joins on sales or stocks (which a good RDBMS will handle just fine).
A more experienced user will hopefully chime in on NoSQL sharding and replication. On this I'd only feel comfortable enough to point out that Postgres has built-in replication features since 9.0 and is quite good at dealing with queries that span multiple partitions.
Anyway... To cut a very long story short: unless you already know that you'll need to instantly scale to petabytes and gazillions of requests in multitudes of data centers in your next project, I think the only consideration that you should have in mind when picking an SQL or NoSQL implementation is whether you absolutely need ACID compliance or not.
Lastly, if your main interest lies in trying a new toy, consider trying a graph-oriented database instead. These potentially combine the benefits of row- and column-based stores.

Related

What is NoSQL? How it works? [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I've been hearing things about NoSQL and that it may eventually become the replacement for SQL DB storage methods due to the fact that DB interaction is often a bottle neck for speed on the web.
So I just have a few questions:
What exactly is it?
How does it work?
Why would it be better than using a SQL Database? And how much better is it?
Is the technology too new to start implementing yet or is it worth taking a look into?
There is no such thing as NoSQL!
NoSQL is a buzzword.
For decades, when people were talking about databases, they meant relational databases. And when people were talking about relational databases, they meant those you control with Edgar F. Codd's Structured Query Language. Storing data in some other way? Madness! Anything else is just flatfiles.
But in the past few years, people started to question this dogma. People wondered if tables with rows and columns are really the only way to represent data. People started thinking and coding, and came up with many new concepts how data could be organized. And they started to create new database systems designed for these new ways of working with data.
The philosophies of all these databases were different. But one thing all these databases had in common, was that the Structured Query Language was no longer a good fit for using them. So each database replaced SQL with their own query languages. And so the term NoSQL was born, as a label for all database technologies which defy the classic relational database model.
So what do NoSQL databases have in common?
Actually, not much.
You often hear phrases like:
NoSQL is scalable!
NoSQL is for BigData!
NoSQL violates ACID!
NoSQL is a glorified key/value store!
Is that true? Well, some of these statements might be true for some databases commonly called NoSQL, but every single one is also false for at least one other. Actually, the only thing NoSQL databases have in common, is that they are databases which do not use SQL. That's it. The only thing that defines them is what sets them apart from each other.
So what sets NoSQL databases apart?
So we made clear that all those databases commonly referred to as NoSQL are too different to evaluate them together. Each of them needs to be evaluated separately to decide if they are a good fit to solve a specific problem. But where do we begin? Thankfully, NoSQL databases can be grouped into certain categories, which are suitable for different use-cases:
Document-oriented
Examples: MongoDB, CouchDB
Strengths: Heterogenous data, working object-oriented, agile development
Their advantage is that they do not require a consistent data structure. They are useful when your requirements and thus your database layout changes constantly, or when you are dealing with datasets which belong together but still look very differently. When you have a lot of tables with two columns called "key" and "value", then these might be worth looking into.
Graph databases
Examples: Neo4j, GiraffeDB.
Strengths: Data Mining
While most NoSQL databases abandon the concept of managing data relations, these databases embrace it even more than those so-called relational databases.
Their focus is at defining data by its relation to other data. When you have a lot of tables with primary keys which are the primary keys of two other tables (and maybe some data describing the relation between them), then these might be something for you.
Key-Value Stores
Examples: Redis, Cassandra, MemcacheDB
Strengths: Fast lookup of values by known keys
They are very simplistic, but that makes them fast and easy to use. When you have no need for stored procedures, constraints, triggers and all those advanced database features and you just want fast storage and retrieval of your data, then those are for you.
Unfortunately they assume that you know exactly what you are looking for. You need the profile of User157641? No problem, will only take microseconds. But what when you want the names of all users who are aged between 16 and 24, have "waffles" as their favorite food and logged in in the last 24 hours? Tough luck. When you don't have a definite and unique key for a specific result, you can't get it out of your K-V store that easily.
Is SQL obsolete?
Some NoSQL proponents claim that their favorite NoSQL database is the new way of doing things, and SQL is a thing of the past.
Are they right?
No, of course they aren't. While there are problems SQL isn't suitable for, it still got its strengths. Lots of data models are simply best represented as a collection of tables which reference each other. Especially because most database programmers were trained for decades to think of data in a relational way, and trying to press this mindset onto a new technology which wasn't made for it rarely ends well.
NoSQL databases aren't a replacement for SQL - they are an alternative.
Most software ecosystems around the different NoSQL databases aren't as mature yet. While there are advances, you still haven't got supplemental tools which are as mature and powerful as those available for popular SQL databases.
Also, there is much more know-how for SQL around. Generations of computer scientists have spent decades of their careers into research focusing on relational databases, and it shows: The literature written about SQL databases and relational data modelling, both practical and theoretical, could fill multiple libraries full of books. How to build a relational database for your data is a topic so well-researched it's hard to find a corner case where there isn't a generally accepted by-the-book best practice.
Most NoSQL databases, on the other hand, are still in their infancy. We are still figuring out the best way to use them.
What exactly is it?
On one hand, a specific system, but it has also become a generic word for a variety of new data storage backends that do not follow the relational DB model.
How does it work?
Each of the systems labelled with the generic name works differently, but the basic idea is to offer better scalability and performance by using DB models that don't support all the functionality of a generic RDBMS, but still enough functionality to be useful. In a way it's like MySQL, which at one time lacked support for transactions but, exactly because of that, managed to outperform other DB systems. If you could write your app in a way that didn't require transactions, it was great.
Why would it be better than using a SQL Database? And how much better is it?
It would be better when your site needs to scale so massively that the best RDBMS running on the best hardware you can afford and optimized as much as possible simply can't keep up with the load. How much better it is depends on the specific use case (lots of update activity combined with lots of joins is very hard on "traditional" RDBMSs) - could well be a factor of 1000 in extreme cases.
Is the technology too new to start implementing yet or is it worth taking a look into?
Depends mainly on what you're trying to achieve. It's certainly mature enough to use. But few applications really need to scale that massively. For most, a traditional RDBMS is sufficient. However, with internet usage becoming more ubiquitous all the time, it's quite likely that applications that do will become more common (though probably not dominant).
Since someone said that my previous post was off-topic, I'll try to compensate :-) NoSQL is not, and never was, intended to be a replacement for more mainstream SQL databases, but a couple of words are in order to get things in the right perspective.
At the very heart of the NoSQL philosophy lies the consideration that, possibly for commercial and portability reasons, SQL engines tend to disregard the tremendous power of the UNIX operating system and its derivatives.
With a filesystem-based database, you can take immediate advantage of the ever-increasing capabilities and power of the underlying operating system, which have been steadily increasing for many years now in accordance with Moore's law. With this approach, many operating-system commands become automatically also "database operators" (think of "ls" "sort", "find" and the other countless UNIX shell utilities).
With this in mind, and a bit of creativity, you can indeed devise a filesystem-based database that is able to overcome the limitations of many common SQL engines, at least for specific usage patterns, which is the whole point behind NoSQL's philosophy, the way I see it.
I run hundreds of web sites and they all use NoSQL to a greater or lesser extent. In fact, they do not host huge amounts of data, but even if some of them did I could probably think of a creative use of NoSQL and the filesystem to overcome any bottlenecks. Something that would likely be more difficult with traditional SQL "jails". I urge you to google for "unix", "manis" and "shaffer" to understand what I mean.
If I recall correctly, it refers to types of databases that don't necessarily follow the relational form. Document databases come to mind, databases without a specific structure, and which don't use SQL as a specific query language.
It's generally better suited to web applications that rely on performance of the database, and don't need more advanced features of Relation Database Engines. For example, a Key->Value store providing a simple query by id interface might be 10-100x faster than the corresponding SQL server implementation, with a lower developer maintenance cost.
One example is this paper for an OLTP Tuple Store, which sacrificed transactions for single threaded processing (no concurrency problem because no concurrency allowed), and kept all data in memory; achieving 10-100x better performance as compared to a similar RDBMS driven system. Basically, it's moving away from the 'One Size Fits All' view of SQL and database systems.
In practice, NoSQL is a database system which supports fast access to large binary objects (docs, jpgs etc) using a key based access strategy. This is a departure from the traditional SQL access which is only good enough for alphanumeric values. Not only the internal storage and access strategy but also the syntax and limitations on the display format restricts the traditional SQL. BLOB implementations of traditional relational databases too suffer from these restrictions.
Behind the scene it is an indirect admission of the failure of the SQL model to support any form of OLTP or support for new dataformats. "Support" means not just store but full access capabilities - programmatic and querywise using the standard model.
Relational enthusiasts were quick to modify the defnition of NoSQL from Not-SQL to Not-Only-SQL to keep SQL still in the picture! This is not good especially when we see that most Java programs today resort to ORM mapping of the underlying relational model. A new concept must have a clearcut definition. Else it will end up like SOA.
The basis of the NoSQL systems lies in the random key - value pair. But this is not new. Traditional database systems like IMS and IDMS did support hashed ramdom keys (without making use of any index) and they still do. In fact IDMS already has a keyword NONSQL where they support SQL access to their older network database which they termed as NONSQL.
It's like Jacuzzi: both a brand and a generic name. It's not just a specific technology, but rather a specific type of technology, in this case referring to large-scale (often sparse) "databases" like Google's BigTable or CouchDB.
NoSQL the actual program appears to be a relational database implemented in awk using flat files on the backend. Though they profess, "NoSQL essentially has no arbitrary limits, and can work where other products can't. For example there is no limit on data field size, the number of columns, or file size" , I don't think it is the large scale database of the future.
As Joel says, massively scalable databases like BigTable or HBase, are much more interesting. GQL is the query language associated with BigTable and App Engine. It's largely SQL tweaked to avoid features Google considers bottle-necks (like joins). However, I haven't heard this referred to as "NoSQL" before.
NoSQL is a database system which doesn't use string based SQL queries to fetch data.
Instead you build queries using an API they will provide, for example Amazon DynamoDB is a good example of a NoSQL database.
NoSQL databases are better for large applications where scalability is important.
Does NoSQL mean non-relational database?
Yes, NoSQL is different from RDBMS and OLAP. It uses looser consistency models than traditional relational databases.
Consistency models are used in distributed systems like distributed shared memory systems or distributed data store.
How it works internally?
NoSQL database systems are often highly optimized for retrieval and appending operations and often offer little functionality beyond record storage (e.g. key-value stores). The reduced run-time flexibility compared to full SQL systems is compensated by marked gains in scalability and performance for certain data models.
It can work on Structured and Unstructured Data. It uses Collections instead of Tables
How do you query such "database"?
Watch SQL vs NoSQL: Battle of the Backends; it explains it all.

SQL versus noSQL (speed)

When people are comparing SQL and noSQL, and concluding the upsides and downsides of each one, what I never hear anyone talking about is the speed.
Isn't performing SQL queries generally faster than performing noSQL queries?
I mean, for me this would be a really obvious conclusion, because you should always be able to find something faster if you know the structure of your database than if you don't.
But people never seem to mention this, so I want to know if my conclusion is right or wrong.
People who tend to use noSQL use it specifically because it fits their use cases. Being divorced from normal RDBMS table relationships and constraints, as well as ACID-ity of data, it's very easy to make it run a lot faster.
Consider Twitter, which uses NoSQL because a user only does very limited things on site, or one exactly - tweet. And concurrency can be considered non-existent since (1) nobody else can modify your tweet and (2) you won't normally be simultaneously tweeting from multiple devices.
The definition of noSQL systems is a very broad one -- a database that doesn't use SQL / is not a RDBMS.
Therefore, the answer to your question is, in short: "it depends".
Some noSQL systems are basically just persistent key/value storages (like Project Voldemort). If your queries are of the type "look up the value for a given key", such a system will (or at least should be) faster that an RDBMS, because it only needs to have a much smaller feature set.
Another popular type of noSQL system is the document database (like CouchDB).
These databases have no predefined data structure.
Their speed advantage relies heavily on denormalization and creating a data layout that is tailored to the queries that you will run on it. For example, for a blog, you could save a blog post in a document together with its comments. This reduces the need for joins and lookups, making your queries faster, but it also could reduce your flexibility regarding queries.
As Einstein would say, speed is relative.
If you need to store a master/detail simple application (like a shopping cart), you would need to do several Insert statements in your SQL application, also you will get a Data set of information when you do a query to get the purchase, if you're using NoSQL, and you're using it well, then you would have all the data for a single order in one simple "record" (document if you use the terms of NoSQL databases like djondb).
So, I really think that the performance of an application can be measured by the number of things it need to do to achieve a single requirement, if you need to do several Inserts to store an order and you only need one simple Insert in a database like djondb then the performance will be 10x faster in the NoSQL world, just because you're using 10 times less calls to the database layer, that's it.
To illustrate my point let me link an example I wrote sometime ago about the differences between NoSQL and SQL data models approach: https://web.archive.org/web/20160510045647/http://djondb.com/blog/nosql-masterdetail-sample/, I know it's a self reference, but basically I wrote it to address this question which I found it's the most challenging question a RDBMS guy could have and it's always a good way to explain why NoSQL is so different from SQL world, and why it will achieve better performance anytime, not because we use "nasa" technology, it's because NoSQL will let the developer do less... and get more, and less code = greater performance.
The answer is: it depends. Generally speaking, the objective of NoSQL DATABASES (no "queries") is scalability. RDBMS usually have some hard limits at some point (I'm talking about millons and millons of rows) where you could not scale any more by traditional means (Replication, clustering, partitioning), and you need something more because your needs keep growing. Or even if you manage to scale, the overall setup is quite complicated. Or you can scale reads, but not writes.
And the queries depends on the particular implementation of your server, the type of query you are doing, the columns in the table, etc... remember that queries are just one part of the RDBMS.
query time of relational database like SQL for 1000 person data is 2000 ms and graph database like neo4j
is 2ms .if you crate more node 1000000 speed stable 2 ms

What are the [dis]advantages of using a key/value table over nullable columns or separate tables? [duplicate]

This question already has answers here:
How to design a product table for many kinds of product where each product has many parameters
(4 answers)
Closed 1 year ago.
I'm upgrading a payment management system I created a while ago. It currently has one table for each payment type it can accept. It is limited to only being able to pay for one thing, which this upgrade is to alleviate. I've been asking for suggestions as to how I should design it, and I have these basic ideas to work from:
Have one table for each payment type, with a few common columns on each. (current design)
Coordinate all payments with a central table that takes on the common columns (unifying payment IDs regardless of type), and identifies another table and row ID that has columns specialized to that payment type.
Have one table for all payment types, and null the columns which are not used for any given type.
Use the central table idea, but store specialized columns in a key/value table.
My goals for this are: not ridiculously slow, self-documenting as much as possible, and maximizing flexibility while maintaining the other goals.
I don't like 1 very much because of the duplicate columns in each table. It reflects the payment type classes inheriting a base class that provides functionality for all payment types... ORM in reverse?
I'm leaning toward 2 the most, because it's just as "type safe" and self-documenting as the current design. But, as with 1, to add a new payment type, I need to add a new table.
I don't like 3 because of its "wasted space", and it's not immediately clear which columns are used for which payment types. Documentation can alleviate the pain of this somewhat, but my company's internal tools do not have an effective method for storing/finding technical documentation.
The argument I was given for 4 was that it would alleviate needing to change the database when adding a new payment method, but it suffers even worse than 3 does from the lack of explicitness. Currently, changing the database isn't a problem, but it could become a logistical nightmare if we decide to start letting customers keep their own database down the road.
So, of course I have my biases. Does anyone have any better ideas? Which design do you think fits best? What criteria should I base my decision on?
Note
This subject is being discussed, and this thread is being referenced in other threads, therefore I have given it a reasonable treatment, please bear with me. My intention is to provide understanding, so that you can make informed decisions, rather than simplistic ones based merely on labels. If you find it intense, read it in chunks, at your leisure; come back when you are hungry, and not before.
What, exactly, about EAV, is "Bad" ?
1 Introduction
There is a difference between EAV (Entity-Attribute-Value Model) done properly, and done badly, just as there is a difference between 3NF done properly and done badly. In our technical work, we need to be precise about exactly what works, and what does not; about what performs well, and what doesn't. Blanket statements are dangerous, misinform people, and thus hinder progress and universal understanding of the issues concerned.
I am not for or against anything, except poor implementations by unskilled workers, and misrepresenting the level of compliance to standards. And where I see misunderstanding, as here, I will attempt to address it.
Normalisation is also often misunderstood, so a word on that. Wikipedia and other free sources actually post completely nonsensical "definitions", that have no academic basis, that have vendor biases so as to validate their non-standard-compliant products. There is a Codd published his Twelve Rules. I implement a minimum of 5NF, which is more than enough for most requirements, so I will use that as a baseline. Simply put, assuming Third Normal Form is understood by the reader (at least that definition is not confused) ...
2 Fifth Normal Form
2.1 Definition
Fifth Normal Form is defined as:
every column has a 1::1 relation with the Primary Key, only
and to no other column, in the table, or in any other table
the result is no duplicated columns, anywhere; No Update Anomalies (no need for triggers or complex code to ensure that, when a column is updated, its duplicates are updated correctly).
it improves performance because (a) it affects less rows and (b) improves concurrency due to reduced locking
I make the distinction that, it is not that a database is Normalised to a particular NF or not; the database is simply Normalised. It is that each table is Normalised to a particular NF: some tables may only require 1NF, others 3NF, and yet others require 5NF.
2.2 Performance
There was a time when people thought that Normalisation did not provide performance, and they had to "denormalise for performance". Thank God that myth has been debunked, and most IT professionals today realise that Normalised databases perform better. The database vendors optimise for Normalised databases, not for denormalised file systems. The truth "denormalised" is, the database was NOT normalised in the first place (and it performed badly), it was unnormalised, and they did some further scrambling to improve performance. In order to be Denormalised, it has to be faithfully Normalised first, and that never took place. I have rewritten scores of such "denormalised for performance" databases, providing faithful Normalisation and nothing else, and they ran at least ten, and as much as a hundred times faster. In addition, they required only a fraction of the disk space. It is so pedestrian that I guarantee the exercise, in writing.
2.3 Limitation
The limitations, or rather the full extent of 5NF is:
it does not handle optional values, and Nulls have to be used (many designers disallow Nulls and use substitutes, but this has limitations if it not implemented properly and consistently)
you still need to change DDL in order to add or change columns (and there are more and more requirements to add columns that were not initially identified, after implementation; change control is onerous)
although providing the highest level of performance due to Normalisation (read: elimination of duplicates and confused relations), complex queries such as pivoting (producing a report of rows, or summaries of rows, expressed as columns) and "columnar access" as required for data warehouse operations, are difficult, and those operations only, do not perform well. Not that this is due only to the SQL skill level available, and not to the engine.
3 Sixth Normal Form
3.1 Definition
Sixth Normal Form is defined as:
the Relation (row) is the Primary Key plus at most one attribute (column)
It is known as the Irreducible Normal Form, the ultimate NF, because there is no further Normalisation that can be performed. Although it was discussed in academic circles in the mid nineties, it was formally declared only in 2003. For those who like denigrating the formality of the Relational Model, by confusing relations, relvars, "relationships", and the like: all that nonsense can be put to bed because formally, the above definition identifies the Irreducible Relation, sometimes called the Atomic Relation.
3.2 Progression
The increment that 6NF provides (that 5NF does not) is:
formal support for optional values, and thus, elimination of The Null Problem
a side effect is, columns can be added without DDL changes (more later)
effortless pivoting
simple and direct columnar access
it allows for (not in its vanilla form) an even greater level of performance in this department
Let me say that I (and others) were supplying enhanced 5NF tables 20 years ago, explicitly for pivoting, with no problem at all, and thus allowing (a) simple SQL to be used and (b) providing very high performance; it was nice to know that the academic giants of the industry had formally defined what we were doing. Overnight, my 5NF tables were renamed 6NF, without me lifting a finger. Second, we only did this where we needed it; again, it was the table, not the database, that was Normalised to 6NF.
3.3 SQL Limitation
It is a cumbersome language, particularly re joins, and doing anything moderately complex makes it very cumbersome. (It is a separate issue that most coders do not understand or use subqueries.) It supports the structures required for 5NF, but only just. For robust and stable implementations, one must implement additional standards, which may consist in part, of additional catalogue tables. The "use by" date for SQL had well and truly elapsed by the early nineties; it is totally devoid of any support for 6NF tables, and desperately in need of replacement. But that is all we have, so we need to just Deal With It.
For those of us who had been implementing standards and additional catalogue tables, it was not a serious effort to extend our catalogues to provide the capability required to support 6NF structures to standard: which columns belong to which tables, and in what order; mandatory/optional; display format; etc. Essentially a full MetaData catalogue, married to the SQL catalogue.
Note that each NF contains each previous NF within it, so 6NF contains 5NF. We did not break 5NF in order provide 6NF, we provided a progression from 5NF; and where SQL fell short we provided the catalogue. What this means is, basic constraints such as for Foreign Keys; and Value Domains which were provided via SQL Declarative Referential integrity; Datatypes; CHECKS; and RULES, at the 5NF level, remained intact, and these constraints were not subverted. The high quality and high performance of standard-compliant 5NF databases was not reduced in anyway by introducing 6NF.
3.4 Catalogue
It is important to shield the users (any report tool) and the developers, from having to deal with the jump from 5NF to 6NF (it is their job to be app coding geeks, it is my job to be the database geek). Even at 5NF, that was always a design goal for me: a properly Normalised database, with a minimal Data Directory, is in fact quite easy to use, and there was no way I was going to give that up. Keep in mind that due to normal maintenance and expansion, the 6NF structures change over time, new versions of the database are published at regular intervals. Without doubt, the SQL (already cumbersome at 5NF) required to construct a 5NF row from the 6NF tables, is even more cumbersome. Gratefully, that is completely unnecessary.
Since we already had our catalogue, which identified the full 6NF-DDL-that-SQL-does-not-provide, if you will, I wrote a small utility to read the catalogue and:
generate the 6NF table DDL.
generate 5NF VIEWS of the 6NF tables. This allowed the users to remain blissfully unaware of them, and gave them the same capability and performance as they had at 5NF
generate the full SQL (not a template, we have those separately) required to operate against the 6NF structures, which coders then use. They are released from the tedium and repetition which is otherwise demanded, and free to concentrate on the app logic.
I did not write an utility for Pivoting because the complexity present at 5NF is eliminated, and they are now dead simple to write, as with the 5NF-enhanced-for-pivoting. Besides, most report tools provide pivoting, so I only need to provide functions which comprise heavy churning of stats, which needs to be performed on the server before shipment to the client.
3.5 Performance
Everyone has their cross to bear; I happen to be obsessed with Performance. My 5NF databases performed well, so let me assure you that I ran far more benchmarks than were necessary, before placing anything in production. The 6NF database performed exactly the same as the 5NF database, no better, no worse. This is no surprise, because the only thing the 'complex" 6NF SQL does, that the 5NF SQL doesn't, is perform much more joins and subqueries.
You have to examine the myths.
Anyone who has benchmarked the issue (i.e examined the execution plans of queries) will know that Joins Cost Nothing, it is a compile-time resolution, they have no effect at execution time.
Yes, of course, the number of tables joined; the size of the tables being joined; whether indices can be used; the distribution of the keys being joined; etc, all cost something.
But the join itself costs nothing.
A query on five (larger) tables in a Unnormalised database is much slower than the equivalent query on ten (smaller) tables in the same database if it were Normalised. the point is, neither the four nor the nine Joins cost anything; they do not figure in the performance problem; the selected set on each Join does figure in it.
3.6 Benefit
Unrestricted columnar access. This is where 6NF really stands out. The straight columnar access was so fast that there was no need to export the data to a data warehouse in order to obtain speed from specialised DW structures.
My research into a few DWs, by no means complete, shows that they consistently store data by columns, as opposed to rows, which is exactly what 6NF does. I am conservative, so I am not about to make any declarations that 6NF will displace DWs, but in my case it eliminated the need for one.
It would not be fair to compare functions available in 6NF that were unavailable in 5NF (eg. Pivoting), which obviously ran much faster.
That was our first true 6NF database (with a full catalogue, etc; as opposed to the always 5NF with enhancements only as necessary; which later turned out to be 6NF), and the customer is very happy. Of course I was monitoring performance for some time after delivery, and I identified an even faster columnar access method for my next 6NF project. That, when I do it, might present a bit of competition for the DW market. The customer is not ready, and we do not fix that which is not broken.
3.7 What, Exactly, about 6NF, is "Bad" ?
Note that not everyone would approach the job with as much formality, structure, and adherence to standards. So it would be silly to conclude from our project, that all 6NF databases perform well, and are easy to maintain. It would be just as silly to conclude (from looking at the implementations of others) that all 6NF databases perform badly, are hard to maintain; disasters. As always, with any technical endeavour, the resulting performance and ease of maintenance are strictly dependent on formality, structure, and adherence to standards, in addition to the relevant skill set.
4 Entity Attribute Value
Disclosure: Experience. I have inspected a few of these, mostly hospital and medical systems. I have performed corrective assignments on two of them. The initial delivery by the overseas provider was quite adequate, although not great, but the extensions implemented by the local provider were a mess. But not nearly the disaster that people have posted about re EAV on this site. A few months intense work fixed them up nicely.
4.1 What It Is
It was obvious to me that the EAV implementations I have worked on are merely subsets of Sixth Normal Form. Those who implement EAV do so because they want some of the features of 6NF (eg. ability to add columns without DDL changes), but they do not have the academic knowledge to implement true 6NF, or the standards and structures to implement and administer it securely. Even the original provider did not know about 6NF, or that EAV was a subset of 6NF, but they readily agreed when I pointed it out to them. Because the structures required to provide EAV, and indeed 6NF, efficiently and effectively (catalogue; Views; automated code generation) are not formally identified in the EAV community, and are missing from most implementations, I classify EAV as the bastard son Sixth Normal Form.
4.2 What, Exactly, about EAV, is "Bad" ?
Going by the comments in this and other threads, yes, EAV done badly is a disaster. More important (a) they are so bad that the performance provided at 5NF (forget 6NF) is lost and (b) the ordinary isolation from the complexity has not been implemented (coders and users are "forced" to use cumbersome navigation). And if they did not implement a catalogue, all sorts of preventable errors will not have been prevented.
All that may well be true for bad (EAV or other) implementations, but it has nothing to do with 6NF or EAV. The two projects I worked had quite adequate performance (sure, it could be improved; but there was no bad performance due to EAV), and good isolation of complexity. Of course, they were nowhere near the quality or performance of my 5NF databases or my true 6NF database, but they were fair enough, given the level of understanding of the posted issues within the EAV community. They were not the disasters and sub-standard nonsense alleged to be EAV in these pages.
5 Nulls
There is a well-known and documented issue called The Null Problem. It is worthy of an essay by itself. For this post, suffice to say:
the problem is really the optional or missing value; here the consideration is table design such that there are no Nulls vs Nullable columns
actually it does not matter because, regardless of whether you use Nulls/No Nulls/6NF to exclude missing values, you will have to code for that, the problem precisely then, is handling missing values, which cannot be circumvented
except of course for pure 6NF, which eliminates the Null Problem
the coding to handle missing values remains
except, with automated generation of SQL code, heh heh
Nulls are bad news for performance, and many of us have decided decades ago not to allow Nulls in the database (Nulls in passed parameters and result sets, to indicate missing values, is fine)
which means a set of Null Substitutes and boolean columns to indicate missing values
Nulls cause otherwise fixed len columns to be variable len; variable len columns should never be used in indices, because a little 'unpacking' has to be performed on every access of every index entry, during traversal or dive.
6 Position
I am not a proponent of EAV or 6NF, I am a proponent of quality and standards. My position is:
Always, in all ways, do whatever you are doing to the highest standard that you are aware of.
Normalising to Third Normal Form is minimal for a Relational Database (5NF for me). DataTypes, Declarative referential Integrity, Transactions, Normalisation are all essential requirements of a database; if they are missing, it is not a database.
if you have to "denormalise for performance", you have made serious Normalisation errors, your design in not normalised. Period. Do not "denormalise", on the contrary, learn Normalisation and Normalise.
There is no need to do extra work. If your requirement can be fulfilled with 5NF, do not implement more. If you need Optional Values or ability to add columns without DDL changes or the complete elimination of the Null Problem, implement 6NF, only in those tables that need them.
If you do that, due only to the fact that SQL does not provide proper support for 6NF, you will need to implement:
a simple and effective catalogue (column mix-ups and data integrity loss are simply not acceptable)
5NF access for the 6NF tables, via VIEWS, to isolate the users (and developers) from the encumbered (not "complex") SQL
write or buy utilities, so that you can generate the cumbersome SQL to construct the 5NF rows from the 6NF tables, and avoid writing same
measure, monitor, diagnose, and improve. If you have a performance problem, you have made either (a) a Normalisation error or (b) a coding error. Period. Back up a few steps and fix it.
If you decide to go with EAV, recognise it for what it is, 6NF, and implement it properly, as above. If you do, you will have a successful project, guaranteed. If you do not, you will have a dog's breakfast, guaranteed.
6.1 There Ain't No Such Thing As A Free Lunch
That adage has been referred to, but actually it has been misused. The way it actually, deeply applies is as above: if you want the benefits of 6NF/EAV, you had better be willing too do the work required to obtain it (catalogue, standards). Of course, the corollary is, if you don't do the work, you won't get the benefit. There is no "loss" of Datatypes; value Domains; Foreign keys; Checks; Rules. Regarding performance, there is no performance penalty for 6NF/EAV, but there is always a substantial performance penalty for sub-standard work.
7 Specific Question
Finally. With due consideration to the context above, and that it is a small project with a small team, there is no question:
Do not use EAV (or 6NF for that matter)
Do not use Nulls or Nullable columns (unless you wish to subvert performance)
Do use a single Payment table for the common payment columns
and a child table for each PaymentType, each with its specific columns
All fully typecast and constrained.
What's this "another row_id" business ? Why do some of you stick an ID on everything that moves, without checking if it is a deer or an eagle ? No. The child is a dependent child. The Relation is 1::1. The PK of the child is the PK of the parent, the common Payment table. This is an ordinary Supertype-Subtype cluster, the Differentiator is PaymentTypeCode. Subtypes and supertypes are an ordinary part of the Relational Model, and fully catered for in the database, as well as in any good modelling tool.
Sure, people who have no knowledge of Relational databases think they invented it 30 years later, and give it funny new names. Or worse, they knowingly re-label it and claim it as their own. Until some poor sod, with a bit of education and professional pride, exposes the ignorance or the fraud. I do not know which one it is, but it is one of them; I am just stating facts, which are easy to confirm.
A. Responses to Comments
A.1 Attribution
I do not have personal or private or special definitions. All statements regarding the definition (such as imperatives) of:
Normalisation,
Normal Forms, and
the Relational Model.
.
refer to the many original texts By EF Codd and CJ Date (not available free on the web)
.
The latest being Temporal Data and The Relational Model by CJ Date, Hugh Darwen, Nikos A Lorentzos
.
and nothing but those texts
.
"I stand on the shoulders of giants"
.
The essence, the body, all statements regarding the implementation (eg. subjective, and first person) of the above are based on experience; implementing the above principles and concepts, as a commercial organisation (salaried consultant or running a consultancy), in large financial institutions in America and Australia, over 32 years.
This includes scores of large assignments correcting or replacing sub-standard or non-relational implementations.
.
The Null Problem vis-a-vis Sixth Normal Form
A freely available White Paper relating to the title (it does not define The Null Problem alone) can be found at:
http://www.dcs.warwick.ac.uk/~hugh/TTM/Missing-info-without-nulls.pdf.
.
A 'nutshell' definition of 6NF (meaningful to those experienced with the other NFs), can be found on p6
A.2 Supporting Evidence
As stated at the outset, the purpose of this post is to counter the misinformation that is rife in this community, as a service to the community.
Evidence supporting statements made re the implementation of the above principles, can be provided, if and when specific statements are identified; and to the same degree that the incorrect statements posted by others, to which this post is a response, is likewise evidenced. If there is going to be a bun fight, let's make sure the playing field is level
Here are a few docs that I can lay my hands on immediately.
a. Large Bank
This is the best example, as it was undertaken for explicitly the reasons in this post, and goals were realised. They had a budget for Sybase IQ (DW product) but the reports were so fast when we finished the project, they did not need it. The trade analytical stats were my 5NF plus pivoting extensions which turned out to be 6NF, described above. I think all the questions asked in the comments have been answered in the doc, except:
- number of rows:
- old database is unknown, but it can be extrapolated from the other stats
- new database = 20 tables over 100M, 4 tables over 10B.
b. Small Financial Institute Part A
Part B - The meat
Part C - Referenced Diagrams
Part D - Appendix, Audit of Indices Before/After (1 line per Index)
Note four docs; the fourth only for those who wish to inspect detailed Index changes. They were running a 3rd party app that could not be changed because the local supplier was out of business, plus 120% extensions which they could, but did not want to, change. We were called in because they upgraded to a new version of Sybase, which was much faster, which shifted the various performance thresholds, which caused large no of deadlocks. Here we Normalised absolutely everything in the server except the db model, with the goal (guaranteed beforehand) of eliminating deadlocks (sorry, I am not going to explain that here: people who argue about the "denormalisation" issue, will be in a pink fit about this one). It included a reversal of "splitting tables into an archive db for performance", which is the subject of another post (yes, the new single table performed faster than the two spilt ones). This exercise applies to MS SQL Server [insert rewrite version] as well.
c. Yale New Haven Hospital
That's Yale School of Medicine, their teaching hospital. This is a third-party app on top of Sybase. The problem with stats is, 80% of the time they were collecting snapshots at nominated test times only, but no consistent history, so there is no "before image" to compare our new consistent stats with. I do not know of any other company who can get Unix and Sybase internal stats on the same graphs, in an automated manner. Now the network is the threshold (which is a Good Thing).
Perhaps you should look this question
The accepted answer from Bill Karwin goes into specific arguments against the key/value table usually know as Entity Attribute Value (EAV)
.. Although many people seem to favor
EAV, I don't. It seems like the most
flexible solution, and therefore the
best. However, keep in mind the adage
TANSTAAFL. Here are some of the
disadvantages of EAV:
No way to make a column mandatory (equivalent of NOT NULL).
No way to use SQL data types to validate entries.
No way to ensure that attribute names are spelled consistently.
No way to put a foreign key on the values of any given attribute, e.g.
for a lookup table.
Fetching results in a conventional tabular layout is complex and
expensive, because to get attributes
from multiple rows you need to do
JOIN for each attribute.
The degree of flexibility EAV gives
you requires sacrifices in other
areas, probably making your code as
complex (or worse) than it would have
been to solve the original problem in
a more conventional way.
And in most cases, it's an unnecessary
to have that degree of flexibility.
In the OP's question about product
types, it's much simpler to create a
table per product type for
product-specific attributes, so you
have some consistent structure
enforced at least for entries of the
same product type.
I'd use EAV only if every row must
be permitted to potentially have a
distinct set of attributes. When you
have a finite set of product types,
EAV is overkill. Class Table
Inheritance would be my first choice.
My #1 principle is not to redesign something for no reason. So I would go with option 1 because that's your current design and it has a proven track record of working.
Spend the redesign time on new features instead.
If I were designing from scratch I would go with number two. It gives you the flexibility you need. However with number 1 already in place and working and this being soemting rather central to your whole app, i would probably be wary of making a major design change without a good idea of exactly what queries, stored procs, views, UDFs, reports, imports etc you would have to change. If it was something I could do with a relatively low risk (and agood testing alrady in place.) I might go for the change to solution 2 otherwise you might beintroducing new worse bugs.
Under no circumstances would I use an EAV table for something like this. They are horrible for querying and performance and the flexibility is way overrated (ask users if they prefer to be able to add new types 3-4 times a year without a program change at the cost of everyday performance).
At first sight, I would go for option 2 (or 3): when possible, generalize.
Option 4 is not very Relational I think, and will make your queries complex.
When confronted to those question, I generally confront those options with "use cases":
-how is design 2/3 behaving when do this or this operation ?

What makes Cassandra (and NoSQL in general) a better solution to an RDBMS?

Well, NoSQL is a buzzword right now so I've been looking into it. I'm yet to get my head around ColumnFamilies and SuperColumns, etc... But I have been looking at how the data is mapped.
After reading this article, and others, it seems the data is mapped in a JSON like format.
Users = {
1: {
username: "dave",
password: "blahblah",
dateReged: "1/1/1"
},
2: {
username: "etc",
password: "blahblah",
dateReged: "2/1/1",
comment: "this guy has a comment and dave doesns't"
},
}
The RDBMS format would be:
Table name: "Users"
id | username | password | dateReged | comment
---+----------+----------+-----------+--------
1 | dave | blahblah | 1/1/1 |
---+----------+----------+-----------+--------
2 | etc | blahblah | 2/1/1 | this guy has a comment and dave doesn't
Assuming I understand this correctly and my above examples are right, why would I choose the RDBMS design over the NoSQL design? Personally, I'd much rather work with the JSON structure... Does this mean I should choose NoSQL over, say, MySQL?
I guess what I'm asking is "when should I choose NoSQL over RDBMS?"
On a side note, as I've said, I'm still not fully understanding how to go about implementing a Cassandra database. Ie, how do I create the above Users table in a new database? Any tutorials, documentation, etc you could point to would be great. My google'ing hasn't turned up much in terms of 'starting from scratch'...
If you are google, then you might be in a position where a NoSQL would be easier on you than a RDBMS. Since you are not, the many advantages an RDBMS provides you will probably be of some use. Significantly, on a single node, NoSQL offers absolutely no advantages over RDBMSes. RDBMSes offer lots of advantages over NoSQL, though. what are they?
RDBMSes use some pretty deep magic to understand the data it owns, and the data you are asking for, in such a way that it can return that data in the most efficient manner possible. If you didn't ask about some column, the rdbms doesn't waste any effort retrieving it. If you are interested in rows that have fields in common across two tables, (this is a join, btw), the RDBMS doesn't have to check every single pair of rows for matches, or what a NoSQL db usually does is just give you everything and make you do the checking. with a RDBMS, you can usually construct queries that are actually 'about' the data you are using, like "if the date is a tuesday", and if your indexes support it (if you do that query alot then you would add such an index) you can get those rows efficiently.
There is another reason why RDBMSes are nice. Transactions are easy on RDBMSes, but are much harder to get right on NoSQL databases. Supposing you are implementing a blogging engine. Suppose the post title (which appears in the URL) needs to be unique across all posts. In an RDBMS, you can easily be sure that you won't get this wrong accidentally. With a NoSQL database, if it does support some kind of transactional integrity, it's usually at the shard level, anything that could possibly require that kind of integrity must be on the same shard. since any pair of users could possibly be posting at the same moment, then every users' post must be on the same shard to get the same effect. Well, then you don't get any benefit at all from NoSQL.
The main advantage of NoSQL is horizontal scalability and distributed storage. That means you can have a large number of 'cluster nodes' and write to them in parallel. The cluster will ensure changes are propagated to the other cluster nodes eventually (eventual consistency).
NoSQL is not so much about SQL (the term means "not only SQL"). In fact, some NoSQL products do support a subset of SQL. The reason the data format is different (JSON or list of property / value pairs versus tabular data) is: within relational databases, the number of columns (and column names) is defined in a central place, which doesn't work well with horizontal scalability (you would need to stop all cluster nodes for schema changes). Also, joins are not supported as much because that would break horizontal scalability (data from multiple cluster nodes may need to be read, if the data is distributed).
NoSQl databases are fine for some websites where you don't need transaction or consistency where all you are doing is presenting some data (but until you get really really large, they are not really very needed).
But if you need to enforce financial rules (or other complex data integrity rules) or internal controls or reporting and aggregating data for reporting, you need an RDBMS. I'll bet even Google uses RDBMS' for their own HR and financial data, etc.
For some web applications, you might even want a combination of both, the nosql database for some types of information, the transactional relational database for orders and other things where transactional consistency is a must.
If you develop web sites, I think you need to thoroughly understand both types of databases and the needs behind them before choosing how to handle any new functionality.
It seems to me that you have almost no knowledge of relational databases and would rather do what is easier for you personally than what is right for the project. Maybe I'm not reading that correctly, but anyone who never uses joins is suspect in terms of understanding relational databases.
You don't decide between these two based on which one seems easier to understand or which is the buzzword of the month, you decide them based on the functionality you will need, not just for the user interface but for administrative tasks, reporting, financial or other types of data auditing, government regulation, data recovery in case of a hardware failure, etc.
RDBMS' are all about consistency. They do a great job on data that gets churned alot with transactions. See also ACID (atomicity, consistency, isolation, durability). Sometimes you don't need all that, like when storing data from logs or working on data that's not going to change, just accumulate.
NoSQL databases let you relax the requirements for transactions and get better performance (as well as scale to large distributed storage silos easier).
The advantage fo NoSql is that its simpler and if you have your OO blinkers on it fullfills all your persistence needs.
The advantage of SQL based realtional database is that you can easily re-use and extend your data in ways that were not envisaged in the original design. Also "Object" databases tend to perform very badly (even if its possable) when you want to do the equivalent of SQLs aggregate queries like COUNT, SUM, AVG.
Googles BIGTABLE which is the biggest OO database anywhere (and probably the biggest database period) also supports SQL and sql features like indexing and strong typing.
Answer is easy. If you need data storage - use NoSQL, if you need more features then just storing data - use RDBMS.
I guess what I'm asking is "when should I choose NoSQL over RDBMS?"
[Caveat: I've never read about NoSQL before]
According to Wikipedia, NoSQL isn't good at joins: which implies (to me) no referential integrity and no normalization.
As many books about NoSQL mention, it's not about which database is better than the other. It's more what you need.
As everyone say in the other answers, many NoSQL databases support horizontal scalability and are focused on high availability but they are not always the best fit for your needs.
for example, Cassandra is great to add or remove nodes from a cluster, allowing that high scalability. But when you compare Cassandra with MySQL in an environment with just one node (one server), and with no distributed architecture, there isn't a lot of different, since the main advantages of Cassandra are not used.
Now, why should you use SQL? The most common reason is transaction management. Currently, no popular NoSQL database natively supports transactions. You can emulate them, but they are not part of the native functionality as in most SQL databases.
For Cassandra, there is a full and free training in https://academy.datastax.com
There you won't only find trainings to install and configure Cassandra, but to use its tools. It even gives you completion certificates.
Datastax has its own distribution of Cassandra, but it follows all the same guidelines as the Apache project; it offers some extra tools.
The simplest answer I can think of is: When your data doesn't fit a relational model.
I gave a talk at OSCON about when NoSQL can be the right choice, and some of the different sub-categories to be aware of: http://assets.en.oreilly.com/1/event/45/The%20NoSQL%20Ecosystem%20Presentation.pdf
Cassandra in and of itself is not better than an RDBMS. It is better under some circumstances. An RDBMS is vastly superior for transaction processing, master data management, reference data, data warehousing and (some forms of) BI.
Use NOSQL if your application requires a flexible schema, variable length rows, variable types of columns, eventual integrity, horizonal scalability on commodity servers, and high availability achieved by means of a distributed architecture.
NOSQL does not do joins for several reasons: you already joined the data before the NOSQL file was loaded so there is no need to; because a distributed join over far-reaching servers would be resource intensive. The first reason above is simple: you have embedded all the data you need into a single structure. If you do not embed the data and have to link, don't expect great performance out of it. Linking is a euphemism for application-provided joining without the benefit of consolidating the data as a join does. Assuming hashing a key is the method of data distribution, different records that have the same hash key would be collocated. Thereby if joining were permitted, the joined data would all be on the same server.
It's not just black and white.

NoSQL vs. SQL when scalability is irrelevant

Recently I have read a lot about different NoSQL databases and how they are being effectively deployed by some major websites out there. I'm starting a project in which I think the schema-free nature of a database such as MongoDB would be tremendously useful. Everything I have read though seems to indicate that the main advantage of a NoSQL database is scalability. Is choosing a NoSQL database for the schema-free design just as legitimate a design decision as that of scalability?
Yes, sometimes RDBMS are not the best solution, although there are ways to accomodate user defined fields (see XML Datatype, EAV design pattern, or just have spare generic columns) sometimes a schema free database is a good choice.
However, you need to nail down your requirements before choosing to go with a document database, as you will loose a lot of the power you may be used to with the relational model
eg...
If you would otherwise have multiple tables in your RDBMS database, you will need to research the features MongoDB affords you to accomodate these needs.
If you will need to query the data in specific ways, again you need to research what MongoDB offers you.
I wouldnt think of NoSQL as replacement for RDBMS, rather a slightly different tool that brings its own sets of advantages and disadvantages making it more suitable for some projects than others.
(Both databases may be used in some circumstances. Also if you decide to go down the route of possibly using MongoDB, once you have researched the websites out there and have more specific questions, you can visit Freenode IRC #mongodb channel)
There are a lot of other conditions that I've been hearing about with non-relational systems vs relational. I prefer this terminology over sql/no-sql as I personally think it describes the differences better, and several of the "no-sql" servers have sql add-ons, so anyway.... what sort of concurrency pattern or tranaction isolation is required in your system. One of the purported differences between rel and non-rel dbs is the "consistent-always", "consistent-mostly" or "consistent-eventually". Relation dbs by default usually fall into the "consistent-mostly" category and with some work, and a whole lot of locking and race conditions, ;) can be "consistent-always" so everyone is always looking at the most correct representation of a given piece of data. Most of what I've read/heard about non-rel dbs is that they are mainly "consistent-eventually". By this it means that there may be many instances of our data floating around, so user "A" may see that we have 92 widgets in inventory, whereas user "B" may see 79, and they may not get reconciled until someone actually goes to pull stuff from the warehouse. Another issue is mutability of data, how often does it need to be updated? The particular non-rel db's I've been exposed to have more overhead for updates, some of them having to regenerate the entire dataset to incorporate any updates.
Now mind, I think non-rel/nosql are great tools if they really match your use case. I've got several I'm looking into now for projects I've got. But you've got to look at all the trade offs when making the decision, otherwise it just turns into more resume driven development.
I don't think you should choose NoSQL datastore for its schema free design. Schema free design always existed in RDBMS via XML and some databases have good XML support. It is a lot easier to deal with a database than a NoSQL datastore. Scalability and big data should be the primary drivers to choose a NoSQL datastore otherwise the tradeoff of ACID and SQL is a lot to switch to NoSQL.
the most important things should be noticed to distinguish between No-SQL and SQL
which is :
NO-SQL useful when data base scales in a huge manner like social network
for example :
Stack Overflow: each question has multiple answers and not imaginary an answer without question, so No-SQL will ensure that each question include it's answers
as a result when needing getting answers of a question we can bring all answers without joining.Because join is the most expensive query in related database
thanks alot
what raised this issue that if you have a large server farm and need to manage the distribution of your data and load balancing which is more difficult and harder to implement using RDBMS and requires high IT skills to design, plan and deploy for your solution (and still performance is less).
but if you have only 3 or 4 servers with small project. I don't think you have an issue about it. NoSQL database is usually considered in large server farms not small number of servers