According to https://www.mongodb.com/nosql-explained:
NoSQL database models are either:
key-value pairs, document-based, graph databases, or wide-column
stores.
But according to What is NoSQL, how does it work, and what benefits does it provide?:
"NoSQL" is basically:
"a generic word for a variety of new data storage backends that do not
follow the relational DB model."
This confuses me because there are many (a lot more than just the 4 listed on the MongoDB source) of database models that aren't listed like star schema, network model, etc. (found here https://en.wikipedia.org/wiki/Database_model)
What am I missing?
I did some additional research and I'm not sure if I got it right but here's my attempt.
A NoSQL Database can be implemented with any non-relational Database Model.
The "NoSQL Database Types": "key-value pairs, document-based, graph databases, or wide-column stores" does not refer to a Database Model. It is referring to the particular data structures used.
Example: MongoDB is one of many NoSQL Databases and it is implemented with a Semi Structured Database Model and is a Document-Based type of NoSQL Database
(My interpretation of the first line in: https://en.wikipedia.org/wiki/MongoDB)
I am wondering whether cubes or tabular models have any advantages over star schemas other than MDX/DAX query speed. Any feedback would be very much appreciated. Thanks.
Christian
When you say "advantages over star schemas", I am assuming that you mean a Star schema in a relational database? The primary difference is the potentially orders of magnatitude difference in speed, but in the area of self-service BI, a bigger advantage of Cubes or Models is that they implement an entirely new semantic layer. They give you the opportunity to rename fields that may have obscure names in the DB, to have more useful recognisable names for the business users and hide more technical fields, that are not useful to end users. You can define reuseable Named Sets and Hierarchies that enable easier, more effective and consistent reporting.
But the two biggies for me are the speed and the business user friendly semantic layer. JK.
New to DW concepts and SSAS. I'm reading alot that normalized relational dbs are optimal for OLTP due to a typical workload of many one-transaction batches. And denormalization is generally better for DW/BI applications because the nature of queries used for reporting are more batch-based... there were other reasons that I don't recall right now.
It sounds like the advice says to create a denormalized model and populate it from the base relationship model and then build your cubes off the denormalized model. Assuming you're using MOLAP storage type, your cube will store and incrementally update your data in a multidimensional model that it builds behind the scenes.
So now we have essentially the same data stored three times!
Am I reading that right? Why do we even need that intermediate denormalized table? It can't be to optimize report queries because those are being run against the multidimensional SSAS data store. Why not just build your cubes against a dsv whose definition is basically a view of the relational db?
The multidimensional model needs the relational model to be available in star schemas (that is what you call "denormalized model") for loading the data. And in many cases, there is some processing like combining data from different sources, keeping the data for reporting longer than it is needed in the OLTP world, keeping historical views like old regional or department structures available for analyzing which are not necessary and hence overwritten in the OLTP world. Hence, this intermediate step makes sense in many cases. You might also want to have clear cut of times, i. e. always report data for complete days (or, in some cases, months), and not have some data for the last day available and some not, which makes comparison of numbers for a day easier than comparing e. g. the sales of today containing only the data up to 10 o'clock with the sales of the whole day yesterday.
In some simple cases, the intermediate relational data structure need not be available physically. A few days ago I prepared a prototype cube where the star schema was just a set of views on the source data. In this case, of course, the data was only physically available in the original source form and in the cube. The structure of the source data did not make the views that inefficient, and thus data loading to the cube was fast enough for the prototype.
Can someone explain to me the advantages and disadvantages for a relation database such as MySQL compared to a graph database such as Neo4j?
In SQL you have multiple tables with various ids linking them. Then you have to join to connect the tables. From the perspective of a newbie why would you design the database to require a join rather than having the connections explicit as edges from the start as with a graph database. Conceptually it would make no sense to a newbie. Presumably there is a very technical but non-conceptual reason for this?
There actually is conceptual reasoning behind both styles. Wikipedia on the relational model and graph databases gives good overviews of this.
The primary difference is that in a graph database, the relationships are stored at the individual record level, while in a relational database, the structure is defined at a higher level (the table definitions).
This has important ramifications:
A relational database is much faster when operating on huge numbers
of records. In a graph database, each record has to be examined
individually during a query in order to determine the structure of
the data, while this is known ahead of time in a relational database.
Relational databases use less storage space, because they don't have
to store all of those relationships.
Storing all of the relationships at the individual-record level only makes sense if there is going to be a lot of variation in the relationships; otherwise you are just duplicating the same things over and over. This means that graph databases are well-suited to irregular, complex structures. But in the real world, most databases require regular, relatively simple structures. This is why relational databases predominate.
The key difference between a graph and relational database is that relational databases work with sets while graph databases work with paths.
This manifests itself in unexpected and unhelpful ways for a RDBMS user. For example when trying to emulate path operations (e.g. friends of friends) by recursively joining in a relational database, query latency grows unpredictably and massively as does memory usage, not to mention that it tortures SQL to express those kinds of operations. More data means slower in a set-based database, even if you can delay the pain through judicious indexing.
As Dan1111 hinted at, most graph databases don't suffer this kind of join pain because they express relationships at a fundamental level. That is, relationships physically exist on disk and they are named, directed, and can be themselves decorated with properties (this is called the property graph model, see: https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model). This means if you chose to, you could look at the relationships on disk and see how they "join" entities. Relationships are therefore first-class entities in a graph database and are semantically far stronger than those implied relationships reified at runtime in a relational store.
So why should you care? For two reasons:
Graph databases are much faster than relational databases for connected data - a strength of the underlying model. A consequence of this is that query latency in a graph database is proportional to how much of the graph you choose to explore in a query, and is not proportional to the amount of data stored, thus defusing the join bomb.
Graph databases make modelling and querying much more pleasant meaning faster development and fewer WTF moments. For example expressing friend-of-friend for a typical social network in Neo4j's Cypher query language is just MATCH (me)-[:FRIEND]->()-[:FRIEND]->(foaf) RETURN foaf.
Dan1111 has already given an answer flagged as correct. A couple of additional points are worth noting in passing.
First, in almost every implementation of graph databases, the records are "pinned" because there are an unknown number of pointers pointing at the record in its current location. This means that a record cannot be shuffled to a new location without either leaving a forwarding address at the old location or breaking an unknown number of pointers.
Theoretically, one could shuffle all the records at once and figure out a way to locate and repair all the pointers. In practice this is an operation that could take weeks on a large graph database, during which time the database would have to be off the air. It's just not feasible.
By contrast, in a relational database, records can be reshuffled on a fairly large scale, and the only thing that has to be done is to rebuild any indexes that have been affected. This is a fairly large operation, but nowhere near as large as the equivalent for a graph database.
The second point worth noting in passing is that the world wide web can be seen as a gigantic graph database. Web pages contain hyperlinks, and hyperlinks reference, among other things, other web pages. The reference is via URLs, which function like pointers.
When a web page is moved to a different URL without leaving a forwarding address at the old URL, an unknown number of hyperlinks will become broken. These broken links then give rise to the dreaded, "Error 404: page not found" message that interrupts the pleasure of so many surfers.
With a relational database we can model and query a graph by using foreign keys and self-joins. Just because RDBMS’ contain the word relational does not mean that they are good at handling relationships. The word relational in RDBMS stems from relational algebra and not from relationship. In an RDBMS, the relationship itself does not exist as an object in its own right. It either needs to be represented explicitly as a foreign key or implicitly as a value in a link table (when using a generic/universal modelling approach). Links between data sets are stored in the data itself.
The more we increase the search depth in a relational database the more self-joins we need to perform and the more our query performance suffers. The deeper we go in our hierarchy the more tables we need to join and the slower our query gets. Mathematically the cost grows exponentially in a relational database. In other words the more complex our queries and relationships get the more we benefit from a graph versus a relational database. We don’t have performance problems in a graph database when navigating the graph. This is because a graph database stores the relationships as separate objects. However, the superior read performance comes at the cost of slower writes.
In certain situations it is easier to change the data model in a graph database than it is in an RDBMS, e.g. in an RDBMS if I change a table relationship from 1:n to m:n I need to apply DDL with potential downtime.
RDBMS has on the other hand advantages in other areas, e.g. aggregating data or doing timestamped version control on data.
I discuss some of the other pros and cons in my blog post on graph databases for data warehousing
While the relational model can easily represent the data that is contained in a graph model, we face two
significant problems in practice:
SQL lacks the syntax to easily perform graph traversal, especially
traversals where the depth is unknown or unbounded. For instance,
using SQL to determine friends of your friends is easy enough, but
it is hard to solve the “degrees of separation” problem.
Performance degrades quickly as we traverse the graph. Each level of traversal
adds significantly to query response time.
Reference: Next Generation Databases
Graph databases are worth investigating for the use cases that they excel in, but I have had some reason to question some assertions in the responses above. In particular:
A relational database is much faster when operating on huge numbers of records (dan1111's first bullet point)
Graph databases are much faster than relational databases for connected data - a strength of the underlying model. A consequence of this is that query latency in a graph database is proportional to how much of the graph you choose to explore in a query, and is not proportional to the amount of data stored, thus defusing the join bomb. (Jim Webber's first bullet point)
In other words the more complex our queries and relationships get the more we benefit from a graph versus a relational database. (Uli Bethke's 2nd paragraph)
While these assertions may well have merit, I have yet to find a way to get my specific use case to align with them.
Reference: Graph Database or Relational Database Common Table Extensions: Comparing acyclic graph query performance
Relational Databases are much more efficient in storing tabular data. Despite the word “relational” in their name, relational databases are much less effective at storing or expressing relationships between stored data elements.
The term 'relational' in relational databases relates more to relating columns within a table, not relating information in different tables. Relationships between columns exist to support set operations. So as Database grows in millions or billions records it becomes extremely slow to retrieve data from relational databases.
Unlike a relational database, a graph database is structured entirely around data relationships. Graph databases treat relationships not as a schema structure but as data, like other values.
It is very fast to retrieve data from graph databases.
From a relational database standpoint, you could think of this as pre-materializing JOINs once at insertion time instead of computing them for every query. Because the data is structured entirely around data relationships, real-time query performance can be achieved no matter how large or connected the dataset gets.
The graph databases take more storage space compared to relational database.
I'm starting to study SQL Server Analysis Services and I'm working my way through the training book, as well as the Developer Training Kit. In both, I find suggestions that the number of tables used in an OLAP database (ideally, star schema) is greatly reduced from the production OLTP database.
From the training kit:
We followed the data dimensional methodology to architect the data mart schema. From some 200 tables in the operational database, the data mart schema contained about 10 dimension tables and 2 fact tables.
From what I understand, the operational databases are usually (somewhat) normalised and the data mart schemas are heavily denormalised. I also believe that denormalising data usually involves adding more tables, not less.
I can't see how you can go from 200 tables to 12, unless you only need to report on a subset of data. And if you do only need to report on a subset of data, why can't you just use the appropriate tables in the operational database (unless there are significant performance gains to be made by using a denormalised star schema)?
Denormalizing is exactly the opposite of Normalizing a database. In a normalized database everything is spit apart into different tables to support concurrent writes to the data. This also has the side effect of generating any given subset of data exactly once (In an ideal 3rd normal form data structrure). A draw back of normalizing is that reads take a lot longer because of the fact that the data is scattered and we need to join tables to make sense of it again (Joins are pretty expensive operations).
When we denormalize, we are taking the data from multiple tables and merging them in to one table. So now we have repeating data in these tables. The repeating data is useful because we don't have to make joins to any other table to get it anymore. Writing to the data store is normally a bad idea because it would mean alot of writes to change all of the data in a table, whereas it would only take one in a normalized database.
OLTP stands for Online Transactional Processing, notice the word Transactional. Transactions are write operations and the OLTP model is optimiized for this. OLAP stands for Online Analytical Processing, Analysis being the keyword meaning lots of reads.
Going from 200 tables to 12 in an OLTP to OLAP process will suprisingly hold nearly all of the data in the OLTP database plus more. The OLTP is unable to record all of the changes over time, but OLAP specializes in this so you get all of your historical data as well as current data.
The star schema is probably the most common for OLAP data stores, the snowflake schema is also pretty common. You should learn about both and how to properly use them. It's just another great tool in your arsenal.
These two books from IBM will answer your questions much more thouroughly and they are free pdf's.
http://www.redbooks.ibm.com/abstracts/sg247138.html
http://www.redbooks.ibm.com/abstracts/sg242238.html