Representing a DAG (directed acyclic graph) - sql

I need to store dependencies in a DAG. (We're mapping a new school curriculum at a very fine grained level)
We're using rails 3
Considerations
Wider than it is deep
Very large
I estimate 5-10 links per node. As the system grows this will increase.
Many reads, few writes
most common are lookups:
dependencies of first and second degree
searching/verifying dependencies
I know SQL, I'll consider NoSQL.
Looking for pointers to good comparisons of implementation options.
Also interested in what we can start with fast, but will be less painful to transition to something more robust/scalable later.

I found this example of modeling a directed acyclic graph in SQL:
http://www.codeproject.com/KB/database/Modeling_DAGs_on_SQL_DBs.aspx?msg=3051183

I think the upcoming version (beta at the moment) of the Ruby bindings for the graph database Neo4j should be a good fit. It's for use with Rails 3. The underlying data model uses nodes and directed relationships/edges with key/value style attributes on both. To scale read-mostly architectures Neo4j uses a master/slave replication setup.

You could use OrientDB as graph database. It's highly optimized for relationships since are stored as link and not JOIN. Load of bidirectional graph with 1,000 vertices needs few milliseconds.
The language binding for Rails is not yet available, but you can use it with HTTP RESTful calls.

You might want to take a look at the act_as_dag gem.
https://github.com/resgraph/acts-as-dag
Also some good writing on Dags with SQL for people that might need some background on this.
http://www.codeproject.com/Articles/22824/A-Model-to-Represent-Directed-Acyclic-Graphs-DAG-o

Related

Why aren't TripleStore implemented as Native Graph Store as Property-Graph Store are?

Sparql based store or put another way, TripleStore, are known to be less efficient than property graph store, on top of not being able to be distributed while maintaining performance as property graph.
I understand that there are a lot of things at stake here, such as inferencing and what not. Putting distribution and inferencing aside where we could limit ourself to RDFS which can be fully captured via SPARQL, I am wondering why that is ?
More specifically why is the storage the issue. What is limiting Sparql Based store to store data as Property graph store does, and performing traversal instead of massive join queries. Can't sparql simply be translated to Gremlin steps for instance ? What is the limitation there? Can't the join be avoided ?
My assumption is, if sparql can be translated in efficient step traversal, and data is stored as property graph do, such as as janusGraph does https://docs.janusgraph.org/latest/data-model.html , then the issue of performance would be bridged while maintaining some inference such as RDFS.
This being said, Sparql is not Turing-complete of course, but at least for what it does, it would do it fast and possibly at scale as well. The goal is not to compete in my view, but to benefit for SPARQL ease of use and using traversal language like gremlin for things that really requires it e.g. OLAP.
Is there any project in that direction, has Apache jena considered any of this?
I saw that Graql of Grakn seem to be using that road for the reason I explain above, hence what's stopping the TripleStore community ?
#Michael, I am happy that you step in as you definitely know more than me on this :) . I am on a learning journey at this point. At your request here is one of the paper that inspired my understanding:
arxiv.org/abs/1801.02911 (SPARQL querying of Property Graphs using
Gremlin Traversals)
I quote them
"We present a comprehensive empirical evaluation of Gremlinator and
demonstrate its validity and applicability by executing SPARQL queries
on top of the leading graph stores Neo4J, Sparksee and Apache
TinkerGraph and compare the performance with the RDF stores Virtuoso,
4Store and JenaTDB. Our evaluation demonstrates the substantial
performance gain obtained by the Gremlin counterparts of the SPARQL
queries, especially for star-shaped and complex queries."
They explain however that things depends somehow on the type of queries.
Or as another answer put that in stack overflow Comparison of Relational Databases and Graph Databases would also help understand the issue between Set and path. My understanding is that TripleStore works with Set too. This being said i am definitely not aware of all the optimization technics implemented in TripleStore lately, and i saw several papers explaining technics to significantly prune set join operation.
On distribution it is more a guts feelings. For instance, doing join operation in a distributed fashion sounds very but very expensive to me. I don't have the papers and my research is not exhaustive on the matters. But from what I have red and I will have to dig in my Evernote :) to back it, that's the fundamental problem with distribution. Automated smart sharding here seems not to help alleviate the issue.
#Michael this a very but very complex subject. I'm definitively on the journey and that's why i am helping myself with stackoverflow to guide my research. You probably have an idea of as to why. So feel free to provides with pointers indeed.
This being said, I am not saying that there is a problem with RDF and that Property-Graph are better. I am saying that somehow, when it comes to graph traversal, there are ways of implementing a backend that makes this fast. The data model is not the issue here, the data structure used to support the traversal is the issue. The second thing that i am saying is that, it seems that the choice of the query language influence how the "traversal" is performed and hence the data structure that is used to back the data model.
That's my understanding so far, and yes I do understand that there are a lot of other factor at play, and feel free to enumerate some of them to guide my journey.
In short my question comes down to, is it possible to have RDF stores backed by a so-called Native Graph Storage and then Implement Sparql in term of Traversal steps rather than joins over set as per its algebra ? Wouldn't that makes things a bit faster. It seems to be that this is somewhat the approach taken by https://github.com/graknlabs/grakn which is primarily backed by janusGraph for a graph like storage. Although it is not RDF, Graql is the same Idea as having RDFS++ + Sparql. They claim to just do it better, for which i have my reservation, but that's not the fundamental question of this thread. The bottom line is they back knowledge representation by the information retrieval (path traversal) and the accompanying storage approach that Property-Graph championed. Let me be clear on this, I am not saying that the graph native storage is the property of property graph. It is just in my mind a storage approach optimized to store Graph Structure where the information retrieval involve (path) traversal: https://docs.janusgraph.org/latest/data-model.html.
First, I'd love to see the references that back up your claim that RDF-based systems are inherently less efficient than property graph ones, because frankly it's a nonsensical claim. Further, there have been distributed, and I'm assuming you mean scale-out, RDF stores, so the claim that they are not able to be distributed is simply incorrect.
The Property Graph model, and Gremlin, can easily be implemented on top of an RDF-based system. This has been done at twice once to my knowledge, and in one of those implementations reasoning was supported at the Gremlin/Property Graph layer. So you don't need to be a Property Graph based system to support that model. There are a myriad of reasons why systems, RDF and Property Graph, make specific implementation choices, from storage to execution and beyond, and those choices are guided some by the "native" model, the technology chosen for implementation, and perhaps most importantly, the use cases for the system and the problems it aims to solve.
Further, it's unclear what you recommend the authors of RDF-based systems actually do; are you suggesting scale-out is beneficial? Are you stating that your preference for the Propety Graph model should be taken as gospel such that RDF-based systems give up and switch data models? Do you want Property Graph systems retrofit RDFS?
Finally, to the initial question you asked, I think you have it exactly backwards; the Property Graph model is a hybrid graph model mixing elements of graph and key-value models, whereas the RDF model is a pure, ie native, graph model. Gremlin will be adopting the RDF model, albeit with syntactic sugar around what in the RDF world is called reification, but to everyone else, edge properties. So in the world where your exemplar of the Property Graph model is abandoning said model, I'm not sure what more to tell you, other than you should do a bit more background research.

How to mix RDMS DB with a Graph DB

I am developing a website using Django, and PostgreSQL which would seemingly have huge amount of data as gathered in social network sites.
I need to use RDMS with SQL for tabular data for less SQL complexity and also Graph DB with Cipher for large data for high query complexity.
Please let me know how to go about this. Also please let me know whether it is feasible.
EDIT: Clarity as asked in Comments:-
The database structure can be similar to that of a social network like Facebook. I've checked FB Engineering page for their open graph. For graph DB I can find only Neo4J graph DB with proper ACID values though I would prefer an open source graph DB. Graph DB structure, I require basically for summary of huge volume data pertaining to relationships like friends, updates, daily user related updates as individual relations. Horizontal Scalability is important for future up gradation to me.
I intend to use PostgreSQL for base informational data and push the relational data updates to graph DB like Facebook uses both MySql and open graph.
Based on your reply to my queries. I would first suggest looking at TitanDB. I believe it fulfills many of your requirements:
It is open source.
It scales horizontally.
In addition to meeting your requirements it has existed for quite sometime and many companies are using it in Production. The only thing you would have to get used to is that it uses TinkerPop traversals, not Cypher queries. Also note that I believe Titan is not ACID for most backends. This is a result of it being horizontally scalable.
If you would like a more structured (but significantly less mature) approach to Graph DBs then you can look at the stack that myself and some colleagues are working on MindmapsDB which sits on top of Titan, but uses a more "sql-like" query language.
OrientDB Gremlin is also a very good option but lacks the maturity and support of Titan.
There are many other graph vendors out there such as DSE Graph, IBM Graph, etc . . . but the ones I have listed above are the opensource ones I have worked with.

Data model design guide lines with GEODE

We are soon going to start something with GEODE regarding reference data. I would like to get some guide lines for the same.
As you know in financial reference data world there exists complex relationships between various reference data entities like Instrument, Account, Client etc. which might be available in database as 3NF.
If my queries are mostly read intensive which requires joins across
tables (2-5 tables), what's the best way to deal with the same with in
memory grid?
Case 1:
Separate regions for all tables in your database and then do a similar join using OQL as you do in database?
Even if you do so, you will have to design it with solid care that related entities are always co-located within same partition.
Modeling 1-to-many and many-many relationship using object graph?
Case 2:
If you know how your join queries look like, create a view model per join query having equi join characteristics.
Confusion:
(1) I have 1 join query requiring Employee,Department using emp.deptId = dept.deptId [OK fantastic 1 region with such view model exists]
(2) I have another join query requiring, Employee, Department, Salary, Address joins to address different requirement
So again I have to create a view model to address (2) which will contain similar Employee and Department data as (1). This may soon reach to memory threshold.
Changes in database can still be managed by event listeners, but what's the recommendations for that?
Thanks,
Dharam
I think your general question is pretty broad and there isn't just one recommended approach to cover all UCs (primarily all your analytical views/models of your data as required by your application(s)).
Such questions involve many factors, such as the size of individual data elements, the volume of data, the frequency of access or access patterns originating from the application or applications, the timely delivery of information, how accurate the data needs to be, the size of your cluster, the physical resources of each (virtual) machine, and so on. Thus, any given approach will undoubtedly require application tuning, tuning GemFire accordingly and JVM tuning regardless of your data model. Still, a carefully crafted data model can determine the extent of such tuning.
In GemFire specifically, such tuning will involve different configuration such as, but not limited to: data management policies, eviction (Overflow) and expiration (LRU, or perhaps custom) settings along with different eviction/expiration thresholds, maybe storing data in Off-Heap memory, employing different partition strategies (PartitionResolver), and so on and so forth.
For example, if your Address information is relatively static, unchanging (i.e. actual "reference" data) then you might consider storing Address data in a REPLICATE Region. Data that is written to frequently (typically "transactional" data) is better off in a PARTITION Region.
Of course, as you know, any PARTITION data (managed in separate Regions) you "join" in a query (using OQL) must be collocated. GemFire/Geode does not currently support distributed joins.
Additionally, certain nodes could host certain Regions, thus dividing your cluster into "transactional" vs. "analytical" nodes, where the analytical-based nodes are updated from CacheListeners on Regions in transactional nodes (be careful of this), or perhaps better yet, asynchronously using an AEQ with AsyncEventListeners. AEQs can be separately made highly available and durable as well. This transactional vs analytical approach is the basis for CQRS.
The size of your data is also impacted by the form in which it is stored, i.e. serialized vs. not serialized, and GemFire's proprietary serialization format (PDX) is quite optimal compared with Java Serialization. It all depends on how "portable" your data needs to be and whether you can keep your data in serialized form.
Also, you might consider how expensive it is to join the data on-the-fly. Meaning, if your are able to aggregate, transform and enrich data at runtime relatively cheaply (compute vs. memory/storage), then you might consider using GemFire's Function Execution service, bringing your logic to the data rather than the data to your logic (the fundamental basis of MapReduce).
You should know, and I am sure you are aware, GemFire is a Key-Value store, therefore mapping a complex object graph into separate Regions is not a trivial problem. Dividing objects up by references (especially many-to-many) and knowing exactly when to eagerly vs. lazily load them is an overloaded problem, especially in a distributed, replicated data store such as GemFire where consistency and availability tradeoffs exist.
There are different APIs and frameworks to simplify persistence and querying with GemFire. One of the more notable approaches is Spring Data GemFire's extension of Spring Data Commons Repository abstraction.
It also might be a matter of using the right data model for the job. If you have very complex data relationships, then perhaps creating analytical models using a graph database (such as Neo4j) would be a simpler option. Spring also provides great support for Neo4j, led by the Neo4j team.
No doubt any design choice you make will undoubtedly involve a hybrid approach. Often times the path is not clear since it really "depends" (i.e. depends on the application and data access patterns, load, all that).
But one thing is for certain, make sure you have a good cursory knowledge and understanding of the underlying data store and it' data management capabilities, particularly as it pertains to consistency and availability, beginning with this.
Note, there is also a GemFire slack channel as well as a Apache DEV mailing list you can use to reach out to the GemFire experts and community of (advanced) GemFire/Geode users if you have more specific problems as you proceed down this architectural design path.

Graph Database: TinkerPop/Blueprints vs W3C Linked data

Looking for an infrastructure for network analysis for heterogeneous (multiple node types (multi-mode), multiple edge type (multi-relation) and multiple descriptive features (multi-featured)) networks, I've noticed that there are two standard stacks in the Graph Database world:
On one hand we have the ThinkPop/Blueprint property graph model. It is supported by Neo4j, OrientDB GraphDB, Dex, Titan, InfiniteGraph, etc.
The Tinkerpop stack includes the Blueprint property graph model interface, the Gremlin graph traversal language, and the Furnace graph algorithms package.
On the other hand we have W3C's Linked Data technology stack, which is supported by AllegroGraph, 4store, Oracle Database Semantic Technologies, OWLIM, SYSTap BigData, etc.
Semantic data is represented using RDF/RDFS/OWL, and can be queried using SPARQL On top it offers rules and reasoning capabilities.
Now, suppose that I want to represent heterogeneous data in a graph database, and analyse such data (statistics, relations discovery, structure, evolution, etc.) (I know these terms are wide and vague) - What are the relative strengths of each model for various types of network analysis tasks? Do these two models complement each other?
Couple things, your exemplars of linked data stacks are all triple stores. You would start building a linked data application by first getting your triple store set up, but calling a database a linked data stack is incorrect imo. That's also an incomplete list of triple stores, there is also Sesame, Jena, Mulgara, and Stardog. Sesame and Jena kind of pull double duty, they're the two de-facto standard Java APIs for the semantic web, but both provide triple stores that come bundled with the APIs. I also know that both Cray and IBM are working on triple stores, but I don't know much about either at this point. I do know that Stardog works well with the TinkerPop stack and that it's basically a drop in and start writing Gremlin queries against the RDF.
I think the strengths of RDF/OWL is that you 1) get a real query language 2) they're w3c standards and 3) you get reasoning, if the triple store supports it, for free (more or less -- you still have to write an ontology).
With RDF/OWL/SPARQL being standards, it makes it quite easy to pick up and move to a new triple store with a different feature set should you need to, your data is already in a common format that everyone understands and any application logic encoded as queries are completely portable. And in most cases, you'd be writing against either the Sesame or Jena APIs, or working over SPARQL protocol, so you might need to only change your config/init. I think that's a big win in the early prototyping phases.
I also think that RDF/OWL especially combined w/ reasoning and the kinds of complex SPARQL queries that you can create with the new SPARQL 1.1 really suit themselves well to building complicated analytic applications. Also, I think that the impression that most people have that RDF triple stores don't scale is no longer correct. Most triple stores at this point easily scale into the billions of triples and have very competitive throughput numbers as well.
So based on what I think you might be doing, I think semweb might be a better bet for you. I did a similar project a few years back using RDF & RDFS for the backend fronted by a simple Pylons based webapp and was very happy with the results.

When to choose Cassandra over a SQL/Semantic Store solution?

I have 30-40 GB of data and 3 developer machines (Core Duo i4, 3GB). The data is a set of graph like structures and I have queries that traverse the graphs. Is there a guideline that could help me to decide to use Cassandra or a classic solution, e.g., SQL or Semantic Store? My current plan is to set up Cassandra and see how does it work but I would like to learn more before starting the installation.
I would not use Cassandra for any kind of graph level structure. It has been about 6 months since I looked into doing something similar so maybe Cassandra has moved on since then but I found it was fundamentally limited by the fact that it only has row level indexes.
For a Graph based structure (assuming a simplistic one arc per row layout) you really need column indexes as well since if you want to traverse the graph you want to be able to start from a particular node A and find all the arcs that go from that node (assuming a directed Graph) then you'd have to do a row scan of the entire dataset as there is no built in functionality for saying give me the rows that have A in a particular column.
To achieve this you have to effectively design a data layout for Cassandra that gives you an inverted index. This is somewhat tricky and requires you to know ahead of time the type of queries that you want to answer - answering new types of queries at a later data may be very difficult or impossible if you don't design well. These slides demonstrate the idea but I hope it makes it clear that you effectively have to construct your own indexes.
For Graph structures that can be decomposed to triples consider an RDF store - for more complex structures then consider a full blown Graph Database. If you really want to do NoSQL you can probably build something on top of a document database as they tend to have much better indexing but again you'll have to think carefully about how you store your data.