Best Database for Computational Problem - sql

I'm trying to decide what database system to use for storing information that is relatively static but needs to be computed in a number of different (runtime specified) ways. The basic contours of the data are votes in the US Congress:
A bill:
has many roll calls
has a name, and other short metadata
has text, and other potentially long metadata
has a status (passed, failed, in progress)
A roll call:
has a date
has many votes
has a status (passed, failed)
A vote:
belongs to a member of Congress
has a kind (aye, nay, present, not voting)
A member of congress:
has a name (and other short metadata)
has many periods
A period:
has a start and end date
has a political party (Democrat, Republican, other)
has a position (member of Congress, committee chair, Speaker, etc.)
I would like to be able to easily build queries like:
For X, Y, and Z roll call votes, tell me the "Democratic" position and the "Republican" position. Then, rank congressmen in the congress those votes were held by their fidelity to those positions.
For X bill which failed, tell me the closest roll calls. Then, tell me which members of the majority party defected to produce those failures.
For X bill passed, but which was opposed by the majority party, tell me which members of the majority defected to produce the passage.
I will have a finite number of query types like these, but the bills, roll call votes, political parties, etc. involved will be dynamically generated.
What is the best storage mechanism for the underlying data that will allow me to issue these queries dynamically and as performantly as possible?

This looks like pretty standard relational data to me. Any RDBMS (MySQL, SqlServer, postgres, etc.) will do.
Or are you asking advice on how to make tables to store this data?

You could use just about any database, until I read:
...rank congressmen...
MySQL doesn't have any ranking functionality. I'm not clear on Postgres' ranking support, but Oracle and SQL Server have supported ranking for a while now (Oracle 9i+, SQL Server 2005+). And they both provide free versions.

Storage mechanism? Any mainstream database should be capable of dealing with the kind of scenario you are describing. Looks pretty standard stuff to me.

As others have stated - any relational database can support a simple model to solve this problem. However, a few other considerations:
This is an analytical and not transactional app and the commercial databases are currently stronger at analytics - because of the more mature optimizers, greater sql functionality, greater support for parallelism, materialized queries, automatic query rewrite against summary tables, etc.
If you just stick with the US congress and don't decide to also support state congresses and don't also decide to add a hundred years of historical data (all useful requirements), then pretty much any popular relational database could handle the performance issues. But if you do decide to get into the state level then I'd consider the commercial databases first.
Of the open source databases I'd consider the analytical functionality of postgresql to be the most mature.

Here's where I would usually chime in and say, use CouchDB or some other schema-free NOSQL database. But the way the problem is spec'd lays out nicely for a relational store. Plus, there's not a terribly large amount of data that would require distributed processing a la mapreduce.
That being said, if the question was framed a bit differently, without the initial relational bias (you're already in data design mode :) ), then a system like CouchDB could work. Depending on the analyses to be performed, a more document-centered approach might be helpful, as all the information needed for an analysis is present on each document (denormalized) and would avoid expensive joins.
Each bill might be one of these docs (json in CouchDB's case), and the rollcalls/votes/congress members with periods as sub-attribs/etc are all on the one 'bill' document. You could then mapreduce over all of the 'bill' documents performing your queries. A different document-oriented design might make sense depending on query requirements.
As the data set grows, you're not worried about size/performance, because you can always use more servers to perform mapreduce queries and distribute the load. Further, schemaless means documents can change as your app changes, without expensive rdbms table locking. But again, this data set doesn't change terribly often, and is not massive.

Related

What is NoSQL? How it works? [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I've been hearing things about NoSQL and that it may eventually become the replacement for SQL DB storage methods due to the fact that DB interaction is often a bottle neck for speed on the web.
So I just have a few questions:
What exactly is it?
How does it work?
Why would it be better than using a SQL Database? And how much better is it?
Is the technology too new to start implementing yet or is it worth taking a look into?
There is no such thing as NoSQL!
NoSQL is a buzzword.
For decades, when people were talking about databases, they meant relational databases. And when people were talking about relational databases, they meant those you control with Edgar F. Codd's Structured Query Language. Storing data in some other way? Madness! Anything else is just flatfiles.
But in the past few years, people started to question this dogma. People wondered if tables with rows and columns are really the only way to represent data. People started thinking and coding, and came up with many new concepts how data could be organized. And they started to create new database systems designed for these new ways of working with data.
The philosophies of all these databases were different. But one thing all these databases had in common, was that the Structured Query Language was no longer a good fit for using them. So each database replaced SQL with their own query languages. And so the term NoSQL was born, as a label for all database technologies which defy the classic relational database model.
So what do NoSQL databases have in common?
Actually, not much.
You often hear phrases like:
NoSQL is scalable!
NoSQL is for BigData!
NoSQL violates ACID!
NoSQL is a glorified key/value store!
Is that true? Well, some of these statements might be true for some databases commonly called NoSQL, but every single one is also false for at least one other. Actually, the only thing NoSQL databases have in common, is that they are databases which do not use SQL. That's it. The only thing that defines them is what sets them apart from each other.
So what sets NoSQL databases apart?
So we made clear that all those databases commonly referred to as NoSQL are too different to evaluate them together. Each of them needs to be evaluated separately to decide if they are a good fit to solve a specific problem. But where do we begin? Thankfully, NoSQL databases can be grouped into certain categories, which are suitable for different use-cases:
Document-oriented
Examples: MongoDB, CouchDB
Strengths: Heterogenous data, working object-oriented, agile development
Their advantage is that they do not require a consistent data structure. They are useful when your requirements and thus your database layout changes constantly, or when you are dealing with datasets which belong together but still look very differently. When you have a lot of tables with two columns called "key" and "value", then these might be worth looking into.
Graph databases
Examples: Neo4j, GiraffeDB.
Strengths: Data Mining
While most NoSQL databases abandon the concept of managing data relations, these databases embrace it even more than those so-called relational databases.
Their focus is at defining data by its relation to other data. When you have a lot of tables with primary keys which are the primary keys of two other tables (and maybe some data describing the relation between them), then these might be something for you.
Key-Value Stores
Examples: Redis, Cassandra, MemcacheDB
Strengths: Fast lookup of values by known keys
They are very simplistic, but that makes them fast and easy to use. When you have no need for stored procedures, constraints, triggers and all those advanced database features and you just want fast storage and retrieval of your data, then those are for you.
Unfortunately they assume that you know exactly what you are looking for. You need the profile of User157641? No problem, will only take microseconds. But what when you want the names of all users who are aged between 16 and 24, have "waffles" as their favorite food and logged in in the last 24 hours? Tough luck. When you don't have a definite and unique key for a specific result, you can't get it out of your K-V store that easily.
Is SQL obsolete?
Some NoSQL proponents claim that their favorite NoSQL database is the new way of doing things, and SQL is a thing of the past.
Are they right?
No, of course they aren't. While there are problems SQL isn't suitable for, it still got its strengths. Lots of data models are simply best represented as a collection of tables which reference each other. Especially because most database programmers were trained for decades to think of data in a relational way, and trying to press this mindset onto a new technology which wasn't made for it rarely ends well.
NoSQL databases aren't a replacement for SQL - they are an alternative.
Most software ecosystems around the different NoSQL databases aren't as mature yet. While there are advances, you still haven't got supplemental tools which are as mature and powerful as those available for popular SQL databases.
Also, there is much more know-how for SQL around. Generations of computer scientists have spent decades of their careers into research focusing on relational databases, and it shows: The literature written about SQL databases and relational data modelling, both practical and theoretical, could fill multiple libraries full of books. How to build a relational database for your data is a topic so well-researched it's hard to find a corner case where there isn't a generally accepted by-the-book best practice.
Most NoSQL databases, on the other hand, are still in their infancy. We are still figuring out the best way to use them.
What exactly is it?
On one hand, a specific system, but it has also become a generic word for a variety of new data storage backends that do not follow the relational DB model.
How does it work?
Each of the systems labelled with the generic name works differently, but the basic idea is to offer better scalability and performance by using DB models that don't support all the functionality of a generic RDBMS, but still enough functionality to be useful. In a way it's like MySQL, which at one time lacked support for transactions but, exactly because of that, managed to outperform other DB systems. If you could write your app in a way that didn't require transactions, it was great.
Why would it be better than using a SQL Database? And how much better is it?
It would be better when your site needs to scale so massively that the best RDBMS running on the best hardware you can afford and optimized as much as possible simply can't keep up with the load. How much better it is depends on the specific use case (lots of update activity combined with lots of joins is very hard on "traditional" RDBMSs) - could well be a factor of 1000 in extreme cases.
Is the technology too new to start implementing yet or is it worth taking a look into?
Depends mainly on what you're trying to achieve. It's certainly mature enough to use. But few applications really need to scale that massively. For most, a traditional RDBMS is sufficient. However, with internet usage becoming more ubiquitous all the time, it's quite likely that applications that do will become more common (though probably not dominant).
Since someone said that my previous post was off-topic, I'll try to compensate :-) NoSQL is not, and never was, intended to be a replacement for more mainstream SQL databases, but a couple of words are in order to get things in the right perspective.
At the very heart of the NoSQL philosophy lies the consideration that, possibly for commercial and portability reasons, SQL engines tend to disregard the tremendous power of the UNIX operating system and its derivatives.
With a filesystem-based database, you can take immediate advantage of the ever-increasing capabilities and power of the underlying operating system, which have been steadily increasing for many years now in accordance with Moore's law. With this approach, many operating-system commands become automatically also "database operators" (think of "ls" "sort", "find" and the other countless UNIX shell utilities).
With this in mind, and a bit of creativity, you can indeed devise a filesystem-based database that is able to overcome the limitations of many common SQL engines, at least for specific usage patterns, which is the whole point behind NoSQL's philosophy, the way I see it.
I run hundreds of web sites and they all use NoSQL to a greater or lesser extent. In fact, they do not host huge amounts of data, but even if some of them did I could probably think of a creative use of NoSQL and the filesystem to overcome any bottlenecks. Something that would likely be more difficult with traditional SQL "jails". I urge you to google for "unix", "manis" and "shaffer" to understand what I mean.
If I recall correctly, it refers to types of databases that don't necessarily follow the relational form. Document databases come to mind, databases without a specific structure, and which don't use SQL as a specific query language.
It's generally better suited to web applications that rely on performance of the database, and don't need more advanced features of Relation Database Engines. For example, a Key->Value store providing a simple query by id interface might be 10-100x faster than the corresponding SQL server implementation, with a lower developer maintenance cost.
One example is this paper for an OLTP Tuple Store, which sacrificed transactions for single threaded processing (no concurrency problem because no concurrency allowed), and kept all data in memory; achieving 10-100x better performance as compared to a similar RDBMS driven system. Basically, it's moving away from the 'One Size Fits All' view of SQL and database systems.
In practice, NoSQL is a database system which supports fast access to large binary objects (docs, jpgs etc) using a key based access strategy. This is a departure from the traditional SQL access which is only good enough for alphanumeric values. Not only the internal storage and access strategy but also the syntax and limitations on the display format restricts the traditional SQL. BLOB implementations of traditional relational databases too suffer from these restrictions.
Behind the scene it is an indirect admission of the failure of the SQL model to support any form of OLTP or support for new dataformats. "Support" means not just store but full access capabilities - programmatic and querywise using the standard model.
Relational enthusiasts were quick to modify the defnition of NoSQL from Not-SQL to Not-Only-SQL to keep SQL still in the picture! This is not good especially when we see that most Java programs today resort to ORM mapping of the underlying relational model. A new concept must have a clearcut definition. Else it will end up like SOA.
The basis of the NoSQL systems lies in the random key - value pair. But this is not new. Traditional database systems like IMS and IDMS did support hashed ramdom keys (without making use of any index) and they still do. In fact IDMS already has a keyword NONSQL where they support SQL access to their older network database which they termed as NONSQL.
It's like Jacuzzi: both a brand and a generic name. It's not just a specific technology, but rather a specific type of technology, in this case referring to large-scale (often sparse) "databases" like Google's BigTable or CouchDB.
NoSQL the actual program appears to be a relational database implemented in awk using flat files on the backend. Though they profess, "NoSQL essentially has no arbitrary limits, and can work where other products can't. For example there is no limit on data field size, the number of columns, or file size" , I don't think it is the large scale database of the future.
As Joel says, massively scalable databases like BigTable or HBase, are much more interesting. GQL is the query language associated with BigTable and App Engine. It's largely SQL tweaked to avoid features Google considers bottle-necks (like joins). However, I haven't heard this referred to as "NoSQL" before.
NoSQL is a database system which doesn't use string based SQL queries to fetch data.
Instead you build queries using an API they will provide, for example Amazon DynamoDB is a good example of a NoSQL database.
NoSQL databases are better for large applications where scalability is important.
Does NoSQL mean non-relational database?
Yes, NoSQL is different from RDBMS and OLAP. It uses looser consistency models than traditional relational databases.
Consistency models are used in distributed systems like distributed shared memory systems or distributed data store.
How it works internally?
NoSQL database systems are often highly optimized for retrieval and appending operations and often offer little functionality beyond record storage (e.g. key-value stores). The reduced run-time flexibility compared to full SQL systems is compensated by marked gains in scalability and performance for certain data models.
It can work on Structured and Unstructured Data. It uses Collections instead of Tables
How do you query such "database"?
Watch SQL vs NoSQL: Battle of the Backends; it explains it all.

SQL versus noSQL (speed)

When people are comparing SQL and noSQL, and concluding the upsides and downsides of each one, what I never hear anyone talking about is the speed.
Isn't performing SQL queries generally faster than performing noSQL queries?
I mean, for me this would be a really obvious conclusion, because you should always be able to find something faster if you know the structure of your database than if you don't.
But people never seem to mention this, so I want to know if my conclusion is right or wrong.
People who tend to use noSQL use it specifically because it fits their use cases. Being divorced from normal RDBMS table relationships and constraints, as well as ACID-ity of data, it's very easy to make it run a lot faster.
Consider Twitter, which uses NoSQL because a user only does very limited things on site, or one exactly - tweet. And concurrency can be considered non-existent since (1) nobody else can modify your tweet and (2) you won't normally be simultaneously tweeting from multiple devices.
The definition of noSQL systems is a very broad one -- a database that doesn't use SQL / is not a RDBMS.
Therefore, the answer to your question is, in short: "it depends".
Some noSQL systems are basically just persistent key/value storages (like Project Voldemort). If your queries are of the type "look up the value for a given key", such a system will (or at least should be) faster that an RDBMS, because it only needs to have a much smaller feature set.
Another popular type of noSQL system is the document database (like CouchDB).
These databases have no predefined data structure.
Their speed advantage relies heavily on denormalization and creating a data layout that is tailored to the queries that you will run on it. For example, for a blog, you could save a blog post in a document together with its comments. This reduces the need for joins and lookups, making your queries faster, but it also could reduce your flexibility regarding queries.
As Einstein would say, speed is relative.
If you need to store a master/detail simple application (like a shopping cart), you would need to do several Insert statements in your SQL application, also you will get a Data set of information when you do a query to get the purchase, if you're using NoSQL, and you're using it well, then you would have all the data for a single order in one simple "record" (document if you use the terms of NoSQL databases like djondb).
So, I really think that the performance of an application can be measured by the number of things it need to do to achieve a single requirement, if you need to do several Inserts to store an order and you only need one simple Insert in a database like djondb then the performance will be 10x faster in the NoSQL world, just because you're using 10 times less calls to the database layer, that's it.
To illustrate my point let me link an example I wrote sometime ago about the differences between NoSQL and SQL data models approach: https://web.archive.org/web/20160510045647/http://djondb.com/blog/nosql-masterdetail-sample/, I know it's a self reference, but basically I wrote it to address this question which I found it's the most challenging question a RDBMS guy could have and it's always a good way to explain why NoSQL is so different from SQL world, and why it will achieve better performance anytime, not because we use "nasa" technology, it's because NoSQL will let the developer do less... and get more, and less code = greater performance.
The answer is: it depends. Generally speaking, the objective of NoSQL DATABASES (no "queries") is scalability. RDBMS usually have some hard limits at some point (I'm talking about millons and millons of rows) where you could not scale any more by traditional means (Replication, clustering, partitioning), and you need something more because your needs keep growing. Or even if you manage to scale, the overall setup is quite complicated. Or you can scale reads, but not writes.
And the queries depends on the particular implementation of your server, the type of query you are doing, the columns in the table, etc... remember that queries are just one part of the RDBMS.
query time of relational database like SQL for 1000 person data is 2000 ms and graph database like neo4j
is 2ms .if you crate more node 1000000 speed stable 2 ms

Advice for hand-written olap-like extractions from relational database

We've implemented over the course of the years a series of web based reports summarizing historical business data (product sales, traffic, etc). The thing relies heavily on complex SQL queries, and the boss expects the results to be real time, but they need up to a minute to execute. The reports are customizable on a several dimensions.
I've done some basic research, and it looks like what we need is some kind of OLAP (?), ETL(?), whatever.
Is that true? Are we supposed to convert to a whole package and trash our beloved developments, or is there a possibility to keep it relational, SQL-based, and get close to a dedicated solution by simply pre-calculating some optimized views with a batch process running at night? Have you got pointers to good documentation on the subject?
Thank you.
You can do ETL (Extract, transform, and load) at night, loading the (probably summarized) data into tables that can usually be queried pretty quickly. Appropriate indexes are still important.
It often makes sense to put those summary tables in a different schema, a different database, or on a different server, but you don't absolutely have to do that.
The structure of the tables is important, and it's not like designing tables for an OLTP system. The IBM Redbooks have a couple of titles that can help you design the tables.
Data Modeling Techniques for Data
Warehousing
Dimensional Modeling: In a Business
Intelligence Environment
Most dbms today support SQL analytic functions. See, for example, Analytic Functions by Example for Oracle, or Window Functions for PostgreSQL.
In the long term, it sounds as though a move to a data warehouse would definitely benefit you (as suggested in Catcall's answer). You can use the existing reports as a starting point for your data warehouse's requirements.
In the short term, you could build summarised tables optimised for your existing reporting requirements. This should probably be regarded as a stopgap, unless you are never going to change these reports again.
You might also benefit from looking into partitioning tables in your database by date/time, since you will probably still want to report the current day's data for realtime reporting purposes.

What makes Cassandra (and NoSQL in general) a better solution to an RDBMS?

Well, NoSQL is a buzzword right now so I've been looking into it. I'm yet to get my head around ColumnFamilies and SuperColumns, etc... But I have been looking at how the data is mapped.
After reading this article, and others, it seems the data is mapped in a JSON like format.
Users = {
1: {
username: "dave",
password: "blahblah",
dateReged: "1/1/1"
},
2: {
username: "etc",
password: "blahblah",
dateReged: "2/1/1",
comment: "this guy has a comment and dave doesns't"
},
}
The RDBMS format would be:
Table name: "Users"
id | username | password | dateReged | comment
---+----------+----------+-----------+--------
1 | dave | blahblah | 1/1/1 |
---+----------+----------+-----------+--------
2 | etc | blahblah | 2/1/1 | this guy has a comment and dave doesn't
Assuming I understand this correctly and my above examples are right, why would I choose the RDBMS design over the NoSQL design? Personally, I'd much rather work with the JSON structure... Does this mean I should choose NoSQL over, say, MySQL?
I guess what I'm asking is "when should I choose NoSQL over RDBMS?"
On a side note, as I've said, I'm still not fully understanding how to go about implementing a Cassandra database. Ie, how do I create the above Users table in a new database? Any tutorials, documentation, etc you could point to would be great. My google'ing hasn't turned up much in terms of 'starting from scratch'...
If you are google, then you might be in a position where a NoSQL would be easier on you than a RDBMS. Since you are not, the many advantages an RDBMS provides you will probably be of some use. Significantly, on a single node, NoSQL offers absolutely no advantages over RDBMSes. RDBMSes offer lots of advantages over NoSQL, though. what are they?
RDBMSes use some pretty deep magic to understand the data it owns, and the data you are asking for, in such a way that it can return that data in the most efficient manner possible. If you didn't ask about some column, the rdbms doesn't waste any effort retrieving it. If you are interested in rows that have fields in common across two tables, (this is a join, btw), the RDBMS doesn't have to check every single pair of rows for matches, or what a NoSQL db usually does is just give you everything and make you do the checking. with a RDBMS, you can usually construct queries that are actually 'about' the data you are using, like "if the date is a tuesday", and if your indexes support it (if you do that query alot then you would add such an index) you can get those rows efficiently.
There is another reason why RDBMSes are nice. Transactions are easy on RDBMSes, but are much harder to get right on NoSQL databases. Supposing you are implementing a blogging engine. Suppose the post title (which appears in the URL) needs to be unique across all posts. In an RDBMS, you can easily be sure that you won't get this wrong accidentally. With a NoSQL database, if it does support some kind of transactional integrity, it's usually at the shard level, anything that could possibly require that kind of integrity must be on the same shard. since any pair of users could possibly be posting at the same moment, then every users' post must be on the same shard to get the same effect. Well, then you don't get any benefit at all from NoSQL.
The main advantage of NoSQL is horizontal scalability and distributed storage. That means you can have a large number of 'cluster nodes' and write to them in parallel. The cluster will ensure changes are propagated to the other cluster nodes eventually (eventual consistency).
NoSQL is not so much about SQL (the term means "not only SQL"). In fact, some NoSQL products do support a subset of SQL. The reason the data format is different (JSON or list of property / value pairs versus tabular data) is: within relational databases, the number of columns (and column names) is defined in a central place, which doesn't work well with horizontal scalability (you would need to stop all cluster nodes for schema changes). Also, joins are not supported as much because that would break horizontal scalability (data from multiple cluster nodes may need to be read, if the data is distributed).
NoSQl databases are fine for some websites where you don't need transaction or consistency where all you are doing is presenting some data (but until you get really really large, they are not really very needed).
But if you need to enforce financial rules (or other complex data integrity rules) or internal controls or reporting and aggregating data for reporting, you need an RDBMS. I'll bet even Google uses RDBMS' for their own HR and financial data, etc.
For some web applications, you might even want a combination of both, the nosql database for some types of information, the transactional relational database for orders and other things where transactional consistency is a must.
If you develop web sites, I think you need to thoroughly understand both types of databases and the needs behind them before choosing how to handle any new functionality.
It seems to me that you have almost no knowledge of relational databases and would rather do what is easier for you personally than what is right for the project. Maybe I'm not reading that correctly, but anyone who never uses joins is suspect in terms of understanding relational databases.
You don't decide between these two based on which one seems easier to understand or which is the buzzword of the month, you decide them based on the functionality you will need, not just for the user interface but for administrative tasks, reporting, financial or other types of data auditing, government regulation, data recovery in case of a hardware failure, etc.
RDBMS' are all about consistency. They do a great job on data that gets churned alot with transactions. See also ACID (atomicity, consistency, isolation, durability). Sometimes you don't need all that, like when storing data from logs or working on data that's not going to change, just accumulate.
NoSQL databases let you relax the requirements for transactions and get better performance (as well as scale to large distributed storage silos easier).
The advantage fo NoSql is that its simpler and if you have your OO blinkers on it fullfills all your persistence needs.
The advantage of SQL based realtional database is that you can easily re-use and extend your data in ways that were not envisaged in the original design. Also "Object" databases tend to perform very badly (even if its possable) when you want to do the equivalent of SQLs aggregate queries like COUNT, SUM, AVG.
Googles BIGTABLE which is the biggest OO database anywhere (and probably the biggest database period) also supports SQL and sql features like indexing and strong typing.
Answer is easy. If you need data storage - use NoSQL, if you need more features then just storing data - use RDBMS.
I guess what I'm asking is "when should I choose NoSQL over RDBMS?"
[Caveat: I've never read about NoSQL before]
According to Wikipedia, NoSQL isn't good at joins: which implies (to me) no referential integrity and no normalization.
As many books about NoSQL mention, it's not about which database is better than the other. It's more what you need.
As everyone say in the other answers, many NoSQL databases support horizontal scalability and are focused on high availability but they are not always the best fit for your needs.
for example, Cassandra is great to add or remove nodes from a cluster, allowing that high scalability. But when you compare Cassandra with MySQL in an environment with just one node (one server), and with no distributed architecture, there isn't a lot of different, since the main advantages of Cassandra are not used.
Now, why should you use SQL? The most common reason is transaction management. Currently, no popular NoSQL database natively supports transactions. You can emulate them, but they are not part of the native functionality as in most SQL databases.
For Cassandra, there is a full and free training in https://academy.datastax.com
There you won't only find trainings to install and configure Cassandra, but to use its tools. It even gives you completion certificates.
Datastax has its own distribution of Cassandra, but it follows all the same guidelines as the Apache project; it offers some extra tools.
The simplest answer I can think of is: When your data doesn't fit a relational model.
I gave a talk at OSCON about when NoSQL can be the right choice, and some of the different sub-categories to be aware of: http://assets.en.oreilly.com/1/event/45/The%20NoSQL%20Ecosystem%20Presentation.pdf
Cassandra in and of itself is not better than an RDBMS. It is better under some circumstances. An RDBMS is vastly superior for transaction processing, master data management, reference data, data warehousing and (some forms of) BI.
Use NOSQL if your application requires a flexible schema, variable length rows, variable types of columns, eventual integrity, horizonal scalability on commodity servers, and high availability achieved by means of a distributed architecture.
NOSQL does not do joins for several reasons: you already joined the data before the NOSQL file was loaded so there is no need to; because a distributed join over far-reaching servers would be resource intensive. The first reason above is simple: you have embedded all the data you need into a single structure. If you do not embed the data and have to link, don't expect great performance out of it. Linking is a euphemism for application-provided joining without the benefit of consolidating the data as a join does. Assuming hashing a key is the method of data distribution, different records that have the same hash key would be collocated. Thereby if joining were permitted, the joined data would all be on the same server.
It's not just black and white.

How to create a multi-tenant database with shared table structures?

Our software currently runs on MySQL. The data of all tenants is stored in the same schema. Since we are using Ruby on Rails we can easily determine which data belongs to which tenant. However there are some companies of course who fear that their data might be compromised, so we are evaluating other solutions.
So far I have seen three options:
Multi-Database (each tenant gets its own - nearly the same as 1 server per customer)
Multi-Schema (not available in MySQL, each tenant gets its own schema in a shared database)
Shared Schema (our current approach, maybe with additional identifying record on each column)
Multi-Schema is my favourite (considering costs). However creating a new account and doing migrations seems to be quite painful, because I would have to iterate over all schemas and change their tables/columns/definitions.
Q: Multi-Schema seems to be designed to have slightly different tables for each tenant - I don't want this. Is there any RDBMS which allows me to use a multi-schema multi-tenant solution, where the table structure is shared between all tenants?
P.S. By multi I mean something like ultra-multi (10.000+ tenants).
However there are some companies of
course who fear that their data might
be compromised, so we are evaluating
other solutions.
This is unfortunate, as customers sometimes suffer from a misconception that only physical isolation can offer enough security.
There is an interesting MSDN article, titled Multi-Tenant Data Architecture, which you may want to check. This is how the authors addressed the misconception towards the shared approach:
A common misconception holds that
only physical isolation can provide an
appropriate level of security. In
fact, data stored using a shared
approach can also provide strong data
safety, but requires the use of more
sophisticated design patterns.
As for technical and business considerations, the article makes a brief analysis on where a certain approach might be more appropriate than another:
The number, nature, and needs of the
tenants you expect to serve all affect
your data architecture decision in
different ways. Some of the following
questions may bias you toward a more
isolated approach, while others may
bias you toward a more shared
approach.
How many prospective tenants do you expect to target? You may be nowhere
near being able to estimate
prospective use with authority, but
think in terms of orders of magnitude:
are you building an application for
hundreds of tenants? Thousands? Tens
of thousands? More? The larger you
expect your tenant base to be, the
more likely you will want to consider
a more shared approach.
How much storage space do you expect the average tenant's data to occupy?
If you expect some or all tenants to
store very large amounts of data, the
separate-database approach is probably
best. (Indeed, data storage
requirements may force you to adopt a
separate-database model anyway. If so,
it will be much easier to design the
application that way from the
beginning than to move to a
separate-database approach later on.)
How many concurrent end users do you expect the average tenant to support?
The larger the number, the more
appropriate a more isolated approach
will be to meet end-user requirements.
Do you expect to offer any per-tenant value-added services, such
as per-tenant backup and restore
capability? Such services are easier
to offer through a more isolated
approach.
UPDATE: Further to update about the expected number of tenants.
That expected number of tenants (10k) should exclude the multi-database approach, for most, if not all scenarios. I don't think you'll fancy the idea of maintaining 10,000 database instances, and having to create hundreds of new ones every day.
From that parameter alone, it looks like the shared-database, single-schema approach is the most suitable. The fact that you'll be storing just about 50Mb per tenant, and that there will be no per-tenant add-ons, makes this approach even more appropriate.
The MSDN article cited above mentions three security patterns that tackle security considerations for the shared-database approach:
Trusted Database Connections
Tenant View Filter
Tenant Data Encryption
When you are confident with your application's data safety measures, you would be able to offer your clients a Service Level Agrement that provides strong data safety guarantees. In your SLA, apart from the guarantees, you could also describe the measures that you would be taking to ensure that data is not compromised.
UPDATE 2: Apparently the Microsoft guys moved / made a new article regarding this subject, the original link is gone and this is the new one: Multi-tenant SaaS database tenancy patterns (kudos to Shai Kerer)
Below is a link to a white-paper on Salesforce.com about how they implement multi-tenancy:
http://www.developerforce.com/media/ForcedotcomBookLibrary/Force.com_Multitenancy_WP_101508.pdf
They have 1 huge table w/ 500 string columns (Value0, Value1, ... Value500). Dates and Numbers are stored as strings in a format such that they can be converted to their native types at the database level. There are meta data tables that define the shape of the data model which can be unique per tenant. There are additional tables for indexing, relationships, unique values etc.
Why the hassle?
Each tenant can customize their own data schema at run-time without having to make changes at the database level (alter table etc). This is definitely the hard way to do something like this but is very flexible.
My experience (albeit SQL Server) is that multi-database is the way to go, where each client has their own database. So although I have no mySQL or Ruby On Rails experience, I'm hoping my input might add some value.
The reasons why include :
data security/disaster recovery. Each companies data is stored entirely separately from others giving reduced risk of data being compromised (thinking things like if you introduce a code bug that means something mistakenly looks at other client data when it shouldn't), minimizes potential loss to one client if one particular database gets corrupted etc. The perceived security benefits to the client are even greater (added bonus side effect!)
scalability. Essentially you'd be partitioning your data out to enable greater scalability - e.g. databases can be put on to different disks, you could bring multiple database servers online and move databases around easier to spread the load.
performance tuning. Suppose you have one very large client and one very small. Usage patterns, data volumes etc. can vary wildly. You can tune/optimise easier for each client should you need to.
I hope this does offer some useful input! There are more reasons, but my mind went blank. If it kicks back in, I'll update :)
EDIT:
Since I posted this answer, it's now clear that we're talking 10,000+ tenants. My experience is in hundreds of large scale databases - I don't think 10,000 separate databases is going to be too manageable for your scenario, so I'm now not favouring the multi-db approach for your scenario. Especially as it's now clear you're talking small data volumes for each tenant!
Keeping my answer here as anyway as it may have some use for other people in a similar boat (with fewer tenants)
As you mention the one database per tenant is an option and does have some larger trade-offs with it. It can work well at smaller scale such as a single digit or low 10's of tenants, but beyond that it becomes harder to manage. Both just the migrations but also just in keeping the databases up and running.
The per schema model isn't only useful for unique schemas for each, though still running migrations across all tenants becomes difficult and at 1000's of schemas Postgres can start to have troubles.
A more scalable approach is absolutely having tenants randomly distributed, stored in the same database, but across different logical shards (or tables). Depending on your language there are a number of libraries that can help with this. If you're using Rails there is a library to enfore the tenancy acts_as_tenant, it helps ensure your tenant queries only pull back that data. There's also a gem apartment - though it uses the schema model it does help with the migrations across all schemas. If you're using Django there's a number but one of the more popular ones seems to be across schemas. All of these help more at the application level. If you're looking for something more at the database level directly, Citus focuses on making this type of sharding for multi-tenancy work more out of the box with Postgres.