Representing transactions/time in RDF - sparql

I need to represent electronic health records in RDF. This kind of data is time dependent. So, I want to represent them as events. I want to use something similar to a Datomic database. Datomic uses triples with an added transaction field. This extra field is time stamped and can have user-defined metadata.
I want to use named graphs to record transaction/time data.
For instance, in the query below, I only search triples of graphs from a certain editor created on a certain date:
SELECT ?name ?mbox ?date
WHERE {
?g dc:publisher ?name ;
dc:date ?date .
GRAPH ?g
{ ?person foaf:name ?name ; foaf:mbox ?mbox }
}
Queries like this one would solve my problem. My concerns are:
I will end up with millions of named graphs. Will they make the SPARQL queries too slow?
The triple store I am using, Blazegraph, has support for inference (entailments) but states that: "Bigdata does not support inference in the quads mode out of the box." Which triple stores do support inference using quads (named graphs)?
Is there a better way to represent this kind of data in RDF? Some kind of best practices guideline?

I will end up with millions of named graphs. Will they make the SPARQL queries too slow?
Generally speaking, not necessarily, at least not anymore than adding millions of triples in one named graph. But it really depends on your triplestore, and how good it is at indexing on named graphs.
The triple store I am using, Blazegraph, has support for inference (entailments) but states that: "Bigdata does not support inference in the quads mode out of the box." Which triple stores do support inference using quads (named graphs)?
StackOverflow is not really the right platform to ask for tool recommendations - I suggest you google around a bit instead to see feature lists of the various available triplestores.
I also suspect that at the scale you need, inferencing performance might disappoint you (again, depending on the implementation of course). Are you sure you need inferencing? Not saying you definitely shouldn't, but depending on the expressivity of the inference you need, there are quite often ways around by being a bit creative in terms of querying.
Is there a better way to represent this kind of data in RDF? Some kind of best practices guideline?
It looks like a sensible approach to me. Whether another way is better is hard to judge without knowing more about the way you intend to use this data, the scale (in number of triples), etc. As for best practices: this W3C note on N-Ary relations in RDF is a good resource. Also: How can I express additional information (time, probability) about a relation in RDF? .

Related

Does an ontology has any impact on the Marklogic triple index?

In Marklogic 9, with a very large set of triples loaded, some SPARQL queries were very slow even if the triple index option was enabled. At that time, billions of data triples were loaded but no ontology triples at all were. After the load of the ontologies, the performance has improved a lot.
I don't believe that's because of the ontologies because my queries are not referring to them at all. But it seems that the triple index were only effective after the load of the ontologies. This is the first time I encountered such a situation. Usually, the data triples can be queried effectively without any ontology.
Any clue why?
This is just a coincidence. There must be some other explanation for the slower/faster queries.

Why aren't TripleStore implemented as Native Graph Store as Property-Graph Store are?

Sparql based store or put another way, TripleStore, are known to be less efficient than property graph store, on top of not being able to be distributed while maintaining performance as property graph.
I understand that there are a lot of things at stake here, such as inferencing and what not. Putting distribution and inferencing aside where we could limit ourself to RDFS which can be fully captured via SPARQL, I am wondering why that is ?
More specifically why is the storage the issue. What is limiting Sparql Based store to store data as Property graph store does, and performing traversal instead of massive join queries. Can't sparql simply be translated to Gremlin steps for instance ? What is the limitation there? Can't the join be avoided ?
My assumption is, if sparql can be translated in efficient step traversal, and data is stored as property graph do, such as as janusGraph does https://docs.janusgraph.org/latest/data-model.html , then the issue of performance would be bridged while maintaining some inference such as RDFS.
This being said, Sparql is not Turing-complete of course, but at least for what it does, it would do it fast and possibly at scale as well. The goal is not to compete in my view, but to benefit for SPARQL ease of use and using traversal language like gremlin for things that really requires it e.g. OLAP.
Is there any project in that direction, has Apache jena considered any of this?
I saw that Graql of Grakn seem to be using that road for the reason I explain above, hence what's stopping the TripleStore community ?
#Michael, I am happy that you step in as you definitely know more than me on this :) . I am on a learning journey at this point. At your request here is one of the paper that inspired my understanding:
arxiv.org/abs/1801.02911 (SPARQL querying of Property Graphs using
Gremlin Traversals)
I quote them
"We present a comprehensive empirical evaluation of Gremlinator and
demonstrate its validity and applicability by executing SPARQL queries
on top of the leading graph stores Neo4J, Sparksee and Apache
TinkerGraph and compare the performance with the RDF stores Virtuoso,
4Store and JenaTDB. Our evaluation demonstrates the substantial
performance gain obtained by the Gremlin counterparts of the SPARQL
queries, especially for star-shaped and complex queries."
They explain however that things depends somehow on the type of queries.
Or as another answer put that in stack overflow Comparison of Relational Databases and Graph Databases would also help understand the issue between Set and path. My understanding is that TripleStore works with Set too. This being said i am definitely not aware of all the optimization technics implemented in TripleStore lately, and i saw several papers explaining technics to significantly prune set join operation.
On distribution it is more a guts feelings. For instance, doing join operation in a distributed fashion sounds very but very expensive to me. I don't have the papers and my research is not exhaustive on the matters. But from what I have red and I will have to dig in my Evernote :) to back it, that's the fundamental problem with distribution. Automated smart sharding here seems not to help alleviate the issue.
#Michael this a very but very complex subject. I'm definitively on the journey and that's why i am helping myself with stackoverflow to guide my research. You probably have an idea of as to why. So feel free to provides with pointers indeed.
This being said, I am not saying that there is a problem with RDF and that Property-Graph are better. I am saying that somehow, when it comes to graph traversal, there are ways of implementing a backend that makes this fast. The data model is not the issue here, the data structure used to support the traversal is the issue. The second thing that i am saying is that, it seems that the choice of the query language influence how the "traversal" is performed and hence the data structure that is used to back the data model.
That's my understanding so far, and yes I do understand that there are a lot of other factor at play, and feel free to enumerate some of them to guide my journey.
In short my question comes down to, is it possible to have RDF stores backed by a so-called Native Graph Storage and then Implement Sparql in term of Traversal steps rather than joins over set as per its algebra ? Wouldn't that makes things a bit faster. It seems to be that this is somewhat the approach taken by https://github.com/graknlabs/grakn which is primarily backed by janusGraph for a graph like storage. Although it is not RDF, Graql is the same Idea as having RDFS++ + Sparql. They claim to just do it better, for which i have my reservation, but that's not the fundamental question of this thread. The bottom line is they back knowledge representation by the information retrieval (path traversal) and the accompanying storage approach that Property-Graph championed. Let me be clear on this, I am not saying that the graph native storage is the property of property graph. It is just in my mind a storage approach optimized to store Graph Structure where the information retrieval involve (path) traversal: https://docs.janusgraph.org/latest/data-model.html.
First, I'd love to see the references that back up your claim that RDF-based systems are inherently less efficient than property graph ones, because frankly it's a nonsensical claim. Further, there have been distributed, and I'm assuming you mean scale-out, RDF stores, so the claim that they are not able to be distributed is simply incorrect.
The Property Graph model, and Gremlin, can easily be implemented on top of an RDF-based system. This has been done at twice once to my knowledge, and in one of those implementations reasoning was supported at the Gremlin/Property Graph layer. So you don't need to be a Property Graph based system to support that model. There are a myriad of reasons why systems, RDF and Property Graph, make specific implementation choices, from storage to execution and beyond, and those choices are guided some by the "native" model, the technology chosen for implementation, and perhaps most importantly, the use cases for the system and the problems it aims to solve.
Further, it's unclear what you recommend the authors of RDF-based systems actually do; are you suggesting scale-out is beneficial? Are you stating that your preference for the Propety Graph model should be taken as gospel such that RDF-based systems give up and switch data models? Do you want Property Graph systems retrofit RDFS?
Finally, to the initial question you asked, I think you have it exactly backwards; the Property Graph model is a hybrid graph model mixing elements of graph and key-value models, whereas the RDF model is a pure, ie native, graph model. Gremlin will be adopting the RDF model, albeit with syntactic sugar around what in the RDF world is called reification, but to everyone else, edge properties. So in the world where your exemplar of the Property Graph model is abandoning said model, I'm not sure what more to tell you, other than you should do a bit more background research.

Members of child class are not classified as members of superclass in LUBM benchmark

I am trying to run the LUBM benchmark but I am having some troubles with classification after reasoning.
The files I am using are:
The main ontology
The output of the LUBM generator 1.7
The problem is that members of GraduateStudent and UndergraduateStudent are not being classified as members of the super-class Student.
I tried Pellet, Hermit and Fact reasoners on Protege 5.0 and all have failed. Consequently, also the benchmark SPARQL query number 10 failed.
#-- Query10
#-- This query differs from Query 6, 7, 8 and 9 in that it only requires the
#-- (implicit) subClassOf relationship between GraduateStudent and Student, i.e.,
#-- subClassOf rela-tionship between UndergraduateStudent and Student does not add
#-- to the results.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://www.lehigh.edu/~zhp2/2004/0401/univ-bench.owl#>
SELECT ?X WHERE {
?X rdf:type ub:Student .
?X ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0>
}
You can find a screenshot of my Protegé classification here (sorry I have not enough reputation to directly post the picture).
On Protege 4.3 classification works with Pellet and Hermit, but the SPARQL query still fails.
I already modified the reasoner settings to show all inferred knowledge, so it is not because they are just hidden.
I find this behavior very confusing, especially considering this should be a proven benchmark. I guess there is a very trivial solution to it but I cannot find it, so any help would be much appreciated!
EDIT: I succeeded to run the benchmark. I manually copied the xml code of the ABox (the one I got from the generator) inside the TBox. In this way classification works on Protege 4 and through API. Also SPARQL queries work using snap SPARQL as suggested here. Classification is still NOT working on Protege 5. I am curious to know what was causing this.
The generator alone only produces Data which is sufficient to answer a subset of the queries (Queries 1-3 and 14). In order for a SPARQL system to answer all the queries it needs to apply inference. How it does this is an implementation specific detail. Also for many systems inference is off by default and must be enabled.
Depending on the System being used you will likely need to provide the main ontology that you linked to your system and enable any appropriate settings that are needed.
It may be that SPARQL queries in Protege do not take into account inferred knowledge but I have never used protege so can't comment on that specific tool.

How can I increase performance of sparql query while using inferencing?

I want to increase performance of my sparql queries. I have to run all type of sparql query.
I have total 17,500,000 triples in the graph and i have other graph containg only knowledge. this graph containing same as and subclassOf property. Total triples of this graph is around 50,000,000, I am using on the fly inferencing in the sparql query.
I am using virtuoso as a database. It has inferencing functionality.
When I run query with inferencing, it is taking 80 secs for simple query. and without using inferencing it is taking 10 secs.
Sparql Query:
DEFINE input:inference 'myrule'
select DISTINCT ?uri1 ?uri2
from <GRAPH_NAME>
where {?uri1 rdf:type ezdi:Aspirin.
?patient ezdi:is_treated_with ?uri1.
?patient rdf:type ezdi:Patient.
?uri2 rdf:type ezdi:Hypertension .
?patient ezdi:is_suffering_with ?uri2.
?patient rdf:type ezdi:Patient } ORDER BY ?patient
I have done all the indexing providing by the virtuoso. System has 32 GB RAM.
And I have done NumberOfBuffer setting virtuoso.ini file.
I dont know what is the issue with inferencing. But I have to use Inferencing in the sparql Query.
If u know something then plz share ur idea.
Thank You
An ontology of 5M triples is quite large, though strictly speaking, that's not problematic. Performance with regards to reasoning is far more closely tied to the expressivity of your ontology than it's size. You could create an ontology with several order of magnitude fewer triples that would be harder to reasoning with.
With that said, there's not much I can specifically suggest. Virtuoso specific tuning is best left to their developers, so you might get some traction on their mailing list.
It appears that you're using some custom inferencing "my_rule" -- though in the comments you also claim RDFS & sameAs. You probably need to figure out what reasoning you're actually using, what profile (RDFS or OWL2 QL, RL, EL, DL) that your ontology falls into, and learn a little bit about how reasoning actually works. Further, equality reasoning is difficult, which you claim to be using in addition to RDFS. It might be possible that Virtuoso can compute the equivalence relations eagerly which could reduce the overhead of the query, but again, that is something you should take up with them on their mailing list.
Reasoning is not easy by any means, and there's no silver bullet for magically making reasoning faster beyond using a simpler, ie less expressive, ontology or less data, or both.
Lastly, you might try other databases which are designed for reasoning, such as OWLIM or Stardog. Not all databases are created equal, and it's entirely possible you've encoded something in your TBox which Virtuoso might not handle well, but could be handled easily by another system.
There are many factors which could lead to the performance issue you describe. The most common is to make an error in the NumberOfBuffers setting in the INI file -- which we cannot see, and so cannot diagnose, here.
Questions specifically regarding Virtuoso are generally best raised on the public OpenLink Discussion Forums, the Virtuso Users mailing list, or through a confidential Support Case. If you bring this there, we should be able to help you in more detail.

Graph Database: TinkerPop/Blueprints vs W3C Linked data

Looking for an infrastructure for network analysis for heterogeneous (multiple node types (multi-mode), multiple edge type (multi-relation) and multiple descriptive features (multi-featured)) networks, I've noticed that there are two standard stacks in the Graph Database world:
On one hand we have the ThinkPop/Blueprint property graph model. It is supported by Neo4j, OrientDB GraphDB, Dex, Titan, InfiniteGraph, etc.
The Tinkerpop stack includes the Blueprint property graph model interface, the Gremlin graph traversal language, and the Furnace graph algorithms package.
On the other hand we have W3C's Linked Data technology stack, which is supported by AllegroGraph, 4store, Oracle Database Semantic Technologies, OWLIM, SYSTap BigData, etc.
Semantic data is represented using RDF/RDFS/OWL, and can be queried using SPARQL On top it offers rules and reasoning capabilities.
Now, suppose that I want to represent heterogeneous data in a graph database, and analyse such data (statistics, relations discovery, structure, evolution, etc.) (I know these terms are wide and vague) - What are the relative strengths of each model for various types of network analysis tasks? Do these two models complement each other?
Couple things, your exemplars of linked data stacks are all triple stores. You would start building a linked data application by first getting your triple store set up, but calling a database a linked data stack is incorrect imo. That's also an incomplete list of triple stores, there is also Sesame, Jena, Mulgara, and Stardog. Sesame and Jena kind of pull double duty, they're the two de-facto standard Java APIs for the semantic web, but both provide triple stores that come bundled with the APIs. I also know that both Cray and IBM are working on triple stores, but I don't know much about either at this point. I do know that Stardog works well with the TinkerPop stack and that it's basically a drop in and start writing Gremlin queries against the RDF.
I think the strengths of RDF/OWL is that you 1) get a real query language 2) they're w3c standards and 3) you get reasoning, if the triple store supports it, for free (more or less -- you still have to write an ontology).
With RDF/OWL/SPARQL being standards, it makes it quite easy to pick up and move to a new triple store with a different feature set should you need to, your data is already in a common format that everyone understands and any application logic encoded as queries are completely portable. And in most cases, you'd be writing against either the Sesame or Jena APIs, or working over SPARQL protocol, so you might need to only change your config/init. I think that's a big win in the early prototyping phases.
I also think that RDF/OWL especially combined w/ reasoning and the kinds of complex SPARQL queries that you can create with the new SPARQL 1.1 really suit themselves well to building complicated analytic applications. Also, I think that the impression that most people have that RDF triple stores don't scale is no longer correct. Most triple stores at this point easily scale into the billions of triples and have very competitive throughput numbers as well.
So based on what I think you might be doing, I think semweb might be a better bet for you. I did a similar project a few years back using RDF & RDFS for the backend fronted by a simple Pylons based webapp and was very happy with the results.