How does one come to the conclusion that the data is "Unrelational" - sql

What sort of pointers, facets, characteristics or structures should one analyse to determine that a given colletion of objects fits bets in a relational store or a Non-relational one? and further, how does one determine that he should use a document store over a KVP store?!?

Related

How is a graph database different to a graph represented in a relational database?

I can represent a graph trivially in a relational database with two tables: vertex and edge. Richer structure like "properties" and "labels" (in Neo4j terminology) can be represented as more tables. Have I misunderstood, or does a graph database like Neo4j allow me to represent anything that is not easily representable relationally?
I can query this graph using SQL, with recursive subqueries if necessary, and with multiple separate queries in a transaction if necessary. Have I misunderstood, or does a graph query language like Cypher provide greater expressivity than SQL?
The relational model of a graph is stored and queried efficiently, AFAIK. Does a graph database structure its storage, or optimize its queries, in some way that provides performance characteristics that cannot be gained from a relational database?
My relational database provides ACID guarantees, and allows me to write fairly expressive constraints on my graph data (and even more constraints if I break out the single vertex table into a properly normalized schema). Have I misunderstood, or does a graph database provide some guarantees or verify some kind of correctness properties that are not available in my relational database?
I am struggling to see how a graph database such as Neo4j is anything but a subset of the relational model. (Apologies for using Neo4j as representative of all graph databases here; it's the only one I've looked at.)
In short: Is graph database ⊆ relational database?
Is One a Subset of the Other?
Definitely no; both are eventually modeled on the mathematical concepts of relations or graphs. Both models being super-general, there is basically no information content that you can't represent using either one. This means that while they might differ in many syntactic sugar ways, and in the way they encourage you to model/think of data (just like programming languages differ) they both have the same "expressive power".
What you describe in your question is one way of modeling a graph (vertex and edge tables). That implementation of a graph is a subset of what relational can express. Similarly, I could mock up tables and rows using a graph database, but I would have chosen a particular implementation - this wouldn't demonstrate that relational data is a subset of graph data.
So the first insight is that they have roughly equal expressive power. You can model anything in either. So the real question you should be asking is why would you choose one over the other?
Why Would you Choose One Over The Other?
All databases exist to facilitate data access. Simply put, you store it so that you can get at the data. But exactly how do you need to get at the data? There are many different access patterns. The design space for databases in general is enormous. Any time a database makes a certain decision, that tends to automatically make it better at some things, worse at others. For example, when you create an index in a relational database, you've just sped up reads -- but you've degraded the performance of writes, because the index has to be maintained.
So, when approaching the question, "Graph or Relational?" - you should first figure out what does your data look like, and what do your data access patterns look like. If you knew what those things were, then you could evaluate a bunch of databases, see the choices they've made, and pick the one that's a good fit for what you need. And then if a DBMS made a choice that would make certain access patterns difficult, buggy, or slow -- you could avoid that DBMS for that data set.
It's (Partly) About Data Access Patterns
Graph databases tend to be better than relational when the data being stored is a graph, when the data access pattern involves a lot of graph traversal, or both. (See this other answer I wrote for a more in-depth discussion of why this is). That link there also provides the answer to your specific question: "Does a graph database structure its storage, or optimize its queries, in some way that provides performance characteristics that cannot be gained from a relational database?"
You say: I can query this graph using SQL, with recursive subqueries if necessary, and with multiple separate queries in a transaction if necessary. -- So technically this is true, but let's take an example to see why relational might not be good enough. Say I have a graph (in RDBMS, a table of nodes, a table of edges, with a join key between them). Let's say I pick out one node, and I want to identify everything that is between 6 and 8 hops away from that node. Here's the cypher to do that:
match (myChosenNode {id: 'foo'})-[r:relationshipType*6..8]->(y) return y;
I really want to see you write that up as SQL. It's possible, but it's hard and complicated. And it will also perform like a dog, because of the sheer quantity of joining you'll be doing on non-trivial quantities of data.
ACID
OK now on the ACID guarantees, Neo4J provides transactions with ACID guarantees. The answer will be different for different graph databases though, particularly the ones implemented on top of Hadoop/HBase. YMMV there, so check the fine print with each database.
It is true that there are a number of features of RDBMS that you typically won't find in graph databases, examples being triggers and certain kinds of constraints. As a long-time RDMBS nerd myself, I'm not so happy about those things being missing, I think they are valuable.
Summary
What this mostly boils down to for me, and many other engineers I work with is:
What is your data?
What are your access patterns?
If your data is a graph, or your access patterns involve a lot of graph traversal, you should probably use a graph DB. If your data is more tabluar, or your access patterns are more oriented around bulk scans, then you should use RDBMS. At the end of the day, they're two different tools with different niches. If you use them in their area of strength, you'll be happy. If you use RDBMS to model a graph just "because you can", you'll suffer. If you use a graph database to do a lot of bulk scans of every node in every graph, you'll suffer. Like most of tech, it's just about using the right tool for the job.

Comparison of Relational Databases and Graph Databases

Can someone explain to me the advantages and disadvantages for a relation database such as MySQL compared to a graph database such as Neo4j?
In SQL you have multiple tables with various ids linking them. Then you have to join to connect the tables. From the perspective of a newbie why would you design the database to require a join rather than having the connections explicit as edges from the start as with a graph database. Conceptually it would make no sense to a newbie. Presumably there is a very technical but non-conceptual reason for this?
There actually is conceptual reasoning behind both styles. Wikipedia on the relational model and graph databases gives good overviews of this.
The primary difference is that in a graph database, the relationships are stored at the individual record level, while in a relational database, the structure is defined at a higher level (the table definitions).
This has important ramifications:
A relational database is much faster when operating on huge numbers
of records. In a graph database, each record has to be examined
individually during a query in order to determine the structure of
the data, while this is known ahead of time in a relational database.
Relational databases use less storage space, because they don't have
to store all of those relationships.
Storing all of the relationships at the individual-record level only makes sense if there is going to be a lot of variation in the relationships; otherwise you are just duplicating the same things over and over. This means that graph databases are well-suited to irregular, complex structures. But in the real world, most databases require regular, relatively simple structures. This is why relational databases predominate.
The key difference between a graph and relational database is that relational databases work with sets while graph databases work with paths.
This manifests itself in unexpected and unhelpful ways for a RDBMS user. For example when trying to emulate path operations (e.g. friends of friends) by recursively joining in a relational database, query latency grows unpredictably and massively as does memory usage, not to mention that it tortures SQL to express those kinds of operations. More data means slower in a set-based database, even if you can delay the pain through judicious indexing.
As Dan1111 hinted at, most graph databases don't suffer this kind of join pain because they express relationships at a fundamental level. That is, relationships physically exist on disk and they are named, directed, and can be themselves decorated with properties (this is called the property graph model, see: https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model). This means if you chose to, you could look at the relationships on disk and see how they "join" entities. Relationships are therefore first-class entities in a graph database and are semantically far stronger than those implied relationships reified at runtime in a relational store.
So why should you care? For two reasons:
Graph databases are much faster than relational databases for connected data - a strength of the underlying model. A consequence of this is that query latency in a graph database is proportional to how much of the graph you choose to explore in a query, and is not proportional to the amount of data stored, thus defusing the join bomb.
Graph databases make modelling and querying much more pleasant meaning faster development and fewer WTF moments. For example expressing friend-of-friend for a typical social network in Neo4j's Cypher query language is just MATCH (me)-[:FRIEND]->()-[:FRIEND]->(foaf) RETURN foaf.
Dan1111 has already given an answer flagged as correct. A couple of additional points are worth noting in passing.
First, in almost every implementation of graph databases, the records are "pinned" because there are an unknown number of pointers pointing at the record in its current location. This means that a record cannot be shuffled to a new location without either leaving a forwarding address at the old location or breaking an unknown number of pointers.
Theoretically, one could shuffle all the records at once and figure out a way to locate and repair all the pointers. In practice this is an operation that could take weeks on a large graph database, during which time the database would have to be off the air. It's just not feasible.
By contrast, in a relational database, records can be reshuffled on a fairly large scale, and the only thing that has to be done is to rebuild any indexes that have been affected. This is a fairly large operation, but nowhere near as large as the equivalent for a graph database.
The second point worth noting in passing is that the world wide web can be seen as a gigantic graph database. Web pages contain hyperlinks, and hyperlinks reference, among other things, other web pages. The reference is via URLs, which function like pointers.
When a web page is moved to a different URL without leaving a forwarding address at the old URL, an unknown number of hyperlinks will become broken. These broken links then give rise to the dreaded, "Error 404: page not found" message that interrupts the pleasure of so many surfers.
With a relational database we can model and query a graph by using foreign keys and self-joins. Just because RDBMS’ contain the word relational does not mean that they are good at handling relationships. The word relational in RDBMS stems from relational algebra and not from relationship. In an RDBMS, the relationship itself does not exist as an object in its own right. It either needs to be represented explicitly as a foreign key or implicitly as a value in a link table (when using a generic/universal modelling approach). Links between data sets are stored in the data itself.
The more we increase the search depth in a relational database the more self-joins we need to perform and the more our query performance suffers. The deeper we go in our hierarchy the more tables we need to join and the slower our query gets. Mathematically the cost grows exponentially in a relational database. In other words the more complex our queries and relationships get the more we benefit from a graph versus a relational database. We don’t have performance problems in a graph database when navigating the graph. This is because a graph database stores the relationships as separate objects. However, the superior read performance comes at the cost of slower writes.
In certain situations it is easier to change the data model in a graph database than it is in an RDBMS, e.g. in an RDBMS if I change a table relationship from 1:n to m:n I need to apply DDL with potential downtime.
RDBMS has on the other hand advantages in other areas, e.g. aggregating data or doing timestamped version control on data.
I discuss some of the other pros and cons in my blog post on graph databases for data warehousing
While the relational model can easily represent the data that is contained in a graph model, we face two
significant problems in practice:
SQL lacks the syntax to easily perform graph traversal, especially
traversals where the depth is unknown or unbounded. For instance,
using SQL to determine friends of your friends is easy enough, but
it is hard to solve the “degrees of separation” problem.
Performance degrades quickly as we traverse the graph. Each level of traversal
adds significantly to query response time.
Reference: Next Generation Databases
Graph databases are worth investigating for the use cases that they excel in, but I have had some reason to question some assertions in the responses above. In particular:
A relational database is much faster when operating on huge numbers of records (dan1111's first bullet point)
Graph databases are much faster than relational databases for connected data - a strength of the underlying model. A consequence of this is that query latency in a graph database is proportional to how much of the graph you choose to explore in a query, and is not proportional to the amount of data stored, thus defusing the join bomb. (Jim Webber's first bullet point)
In other words the more complex our queries and relationships get the more we benefit from a graph versus a relational database. (Uli Bethke's 2nd paragraph)
While these assertions may well have merit, I have yet to find a way to get my specific use case to align with them.
Reference: Graph Database or Relational Database Common Table Extensions: Comparing acyclic graph query performance
Relational Databases are much more efficient in storing tabular data. Despite the word “relational” in their name, relational databases are much less effective at storing or expressing relationships between stored data elements.
The term 'relational' in relational databases relates more to relating columns within a table, not relating information in different tables. Relationships between columns exist to support set operations. So as Database grows in millions or billions records it becomes extremely slow to retrieve data from relational databases.
Unlike a relational database, a graph database is structured entirely around data relationships. Graph databases treat relationships not as a schema structure but as data, like other values.
It is very fast to retrieve data from graph databases.
From a relational database standpoint, you could think of this as pre-materializing JOINs once at insertion time instead of computing them for every query. Because the data is structured entirely around data relationships, real-time query performance can be achieved no matter how large or connected the dataset gets.
The graph databases take more storage space compared to relational database.

Redis how to represent a 2D world (MMO?)

I'm making an MMO which will have a 2d world. (This is a learning project so don't try to talk me out of it :) I'd like to experiment with a new data store for it, and I've read good things about redis. I've been through the tutorials, and I think I'm beginning to understand redis' strength, but it's not clear how I might model such a world.
Here are some design requirements
I need to be able to send down the current position of elements in the world. It would be acceptable to divide these into geographic "rooms" for performance.
I need to be able to "move" an object in the world, changing its position
I need to get information about an object: what's its name, type, attributes, etc.
I need to be able to calculate basic collisions (obj a wants to move right, is there a rock there?)
How would you model this in redis? Is redis an inappropriate choice?
Assuming the 2D world is split into squares of known coordinates, you would probably do best in redis to produce keys based on the coordinates which return Sets of ID's of objects (or exits/paths, or terrain, etc.)
e.g. a very simple illustration
obj:1:name = Rock
obj:1:passable = false
obj:2:name = Skeleton
obj:2:passable = true
loc:0:0:objs = {1,2} // loc:0:0 contains obj:1 and obj:2
loc:0:0:paths = {0:1, 1:0, 1:1} // three legal paths, to loc:0:1, loc:1:0, loc:1:1
I'm not enough of a redis expert to know if this domain is going to cause you problems in redis, so take my advice with a grain of salt.
Redis is a key-value store, and is likely not very well suited to this task out of the box, but if I were gonna take a stab at this, I would probably read up on R-trees (or another spatially oriented data structure) and then map such a data structure onto the key-value store pattern. For instance, you could set your MMO world as an R-tree in Redis, by making each node in the tree correspond to a particular key in Redis, which then would contain the bounding rect (and/or other leaf data) and a list of keys pointing to each of the child nodes.
This will at least allow you to keep your value sizes constrained (a property not necessarily shared by a flat, uniform grid representation), and keep the number of Redis lookups required for any given operation on the order of log(n).
Have fun! Sounds like a fun way to learn.

Fast in-memory inverted index

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.
I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).
Have you looked at Terrier? I'm not quite sure it has in-memory indexes, but it is far more extensible regarding indexing and scoring than Lucene.
Lucene lets you store pretty much any data associated with a document. It also has a feature called "payloads" that allow you to store arbitrary data in the index associated with a term in a document. So I think what you want is to store your "features" as terms in the index, and the weights as payloads, and you should be able to make Lucene do what you want. It does have an in-memory index implementation.
If the pairs of entities you want to compare are already given in advance, and you are interested in the pair-wise scores, I don't think Lucene will give you any advantage. Just lookup the vectors in some key-value store and compute the similarity. Consider using a sparse vector representation for space and time efficiency.
If only one entity is given in advance, and you are more interested in a ranking like scenario, Lucene may be worth a try.
The right place to look at would be
org.apache.lucene.search.Similarity
you should be able to adapt it to your needs and set your version as default with
setDefault(Similarity similarity)
I would be careful with expectations for speed gains (w.r.t. iterating through all) however, as they largely depend on the sparsity (of the query) and the scoring function you choose to implement. Also note that Lucene uses a two-stage retrieval scheme, first boolean ("all of the AND terms contained? any of the OR terms?") then scoring what passes. While for tf.idf you lose nothing on the way for other scoring functions you might.
For more general approaches for efficient approximate nearest neighbor search it might be worthwhile to look into LSH:
http://en.wikipedia.org/wiki/Locality-sensitive_hashing

What are the best uses of document stores?

I have been hearing a lot about document oriented data stores like CouchDB. I understand the uses of BigTable like stores such as Cassandra. After reading this question, I was wondering what the conditions would be to merit using a document store?
Column-family stores such as Bigtable and Cassandra have very limited querying capabilities. The application is responsible for maintaining indexes in order to query a more complex data model.
Document databases allow you to query the content, not just the key. It will also manage the indexes for you, reducing the complexity of your application.
Domain-driven design evangelizes the use of aggregates and value objects. As Ayende points out, (complex) aggregates are very natural candidates to be stored as a single document, instead of normalizing them over multiple tables or column families. This will reduce the complexity of your persistence layer. There's also less chance that related data is scattered across multiple nodes, as all the data is contained in a single document.
If your application needs to store polymorphic objects, document databases are also a good candidate. Of course, this could also be stored in Cassandra, but you won't have as much querying capabilities. At least not out of the box.
Think of a document database as a luxurious sports car. It doesn't need a professional driver (read: complex application) to get you from A to B, it has features such as air conditioning and comfortable seats and it will lap the high-scalability track in an acceptable time. However, if you want to set a lap record on the high-scalability track, you will need a professional driver and a highly optimized car (e.g. Cassandra), which lacks features such as air conditioning.
Another feature of CouchDB is that you can create those aggregations, not as documents stored manually, but as views (which are derived from the stored data, and updated automatically.)
This is like power windows, heated seats, or the kicking stereo.