In Marklogic 9, with a very large set of triples loaded, some SPARQL queries were very slow even if the triple index option was enabled. At that time, billions of data triples were loaded but no ontology triples at all were. After the load of the ontologies, the performance has improved a lot.
I don't believe that's because of the ontologies because my queries are not referring to them at all. But it seems that the triple index were only effective after the load of the ontologies. This is the first time I encountered such a situation. Usually, the data triples can be queried effectively without any ontology.
Any clue why?
This is just a coincidence. There must be some other explanation for the slower/faster queries.
Related
To explain the problem I am facing, let me take an example from the Ruby Sequel documentation.
In the case of
Album.eager(:artists, :genre).all
This will be fine if the data is comparatively less.
If the data is huge this query will fire artist_id in thousands. If data is in millions that would be a million artist_ids being collected. Is there way to fetch the same output but with different optimized query?
Just don't eager load large datasets, maybe? :)
(TL;DR: this is not an answer but rather a wordy way to say "there is no single simple answer" :))
Simple solutions for simple problems. Default (naive) eager loading strategy efficiently solves the N+1 problem for reasonably small relations, but if you try to eager load huge relations you might (and will) shoot own leg.
For example, if you fetch just 1000 albums with ids, say, (1...1000), then your ORM will fire additional eager loading queries for artists and genres that might look like select * from <whatever> where album_id in (1,2,3,...,1000) with 1000 ids in the in list. And this is already a problem on its own - performance of where ... in queries can be suboptimal even in modern dbs with their query planners as smart as Einstein. At certain data scale this will become awfully slow even with such a small batch of data. And if you try to eager load just everything (as in your example) - it's not feasible for almost any real-world data usage (except the smallest use cases).
So,
In general, it is better to avoid loading "all" and load data in batches instead;
EXPLAIN is your best friend - always analyze queries that your ORM fires. Even production grade battle-tested ORM can (and will) produce sub-optimal queries time to time;
The latter is especially true for large datasets - at certain scale you will have no other choices but to move from nice ORM API to lower level custom-tailored SQL queries (at least for bottlenecks);
At certain scale even custom SQL will not help any more - the problems of that scale need to be addressed on another level (data remodeling, caching, sharding, CQRS etc etc etc ...)
I need to represent electronic health records in RDF. This kind of data is time dependent. So, I want to represent them as events. I want to use something similar to a Datomic database. Datomic uses triples with an added transaction field. This extra field is time stamped and can have user-defined metadata.
I want to use named graphs to record transaction/time data.
For instance, in the query below, I only search triples of graphs from a certain editor created on a certain date:
SELECT ?name ?mbox ?date
WHERE {
?g dc:publisher ?name ;
dc:date ?date .
GRAPH ?g
{ ?person foaf:name ?name ; foaf:mbox ?mbox }
}
Queries like this one would solve my problem. My concerns are:
I will end up with millions of named graphs. Will they make the SPARQL queries too slow?
The triple store I am using, Blazegraph, has support for inference (entailments) but states that: "Bigdata does not support inference in the quads mode out of the box." Which triple stores do support inference using quads (named graphs)?
Is there a better way to represent this kind of data in RDF? Some kind of best practices guideline?
I will end up with millions of named graphs. Will they make the SPARQL queries too slow?
Generally speaking, not necessarily, at least not anymore than adding millions of triples in one named graph. But it really depends on your triplestore, and how good it is at indexing on named graphs.
The triple store I am using, Blazegraph, has support for inference (entailments) but states that: "Bigdata does not support inference in the quads mode out of the box." Which triple stores do support inference using quads (named graphs)?
StackOverflow is not really the right platform to ask for tool recommendations - I suggest you google around a bit instead to see feature lists of the various available triplestores.
I also suspect that at the scale you need, inferencing performance might disappoint you (again, depending on the implementation of course). Are you sure you need inferencing? Not saying you definitely shouldn't, but depending on the expressivity of the inference you need, there are quite often ways around by being a bit creative in terms of querying.
Is there a better way to represent this kind of data in RDF? Some kind of best practices guideline?
It looks like a sensible approach to me. Whether another way is better is hard to judge without knowing more about the way you intend to use this data, the scale (in number of triples), etc. As for best practices: this W3C note on N-Ary relations in RDF is a good resource. Also: How can I express additional information (time, probability) about a relation in RDF? .
I am in the situation to test several "select *" kind of SPARQL queries
against few TB of triple data, of course not in production.
However, we only have limited machine resources (4GB memory) to test the queries.
I understand that it requires more memory, but are there any alternatives to
run queries and get results? (Time consuming is welcome)
My laptop has 32GB of RAM, so that sounds under-resourced even for a dev server. Having said that, for any particular query I would look for ways to reduce the number of triples you're running against. Are your triples segmented into graphs, and if so, can a query be directed against one graph? Another reducing strategy is to use the $query parameter to sem:sparql to identify documents that hold the triples you care about.
I wanted to know the computational complexity of ontologies. I am using the NIFSTD ontology and want to make a hierarchy based on some specific queries computation time (big O).
I have read that SPARQL itself is Pspace-complete. As you know, ontologies are usually based on RDF and are queried with the SPARQL.
I wanted the cost of searching for a concept in the ontology (select) and searching with a condition (where {...}).
In addition, is the computational complexity of opening and reading and ontology O(n) where n equals the size of the ontology file?
Thanks in Advance,
Aref
Searching for a concept where the IRI is known is very fast - most APIs have indexes for these, so you can assume O(1) for this.
Searching with a where condition depends entirely on the condition. The more complex the condition, the higher the cost - the upper limit is the same as the worst case complexity you mentioned.
Another factor to be taken into account is whether you want to include any inference regime in your analysis or not; if so, then there's another source of unknown complexity there - OWL 2 DL inferencing has exponential worst case complexity, but what the actual difficulty of reasoning on a given ontology is depends on many factors, and it is not easy to predict. Still an open research question.
The cost of loading an ontology is roughly proportional to its size, although some axioms require more processing than others. Big variations between different APIs are to be expected; I would consider O(n) an acceptable approximation.
I want to increase performance of my sparql queries. I have to run all type of sparql query.
I have total 17,500,000 triples in the graph and i have other graph containg only knowledge. this graph containing same as and subclassOf property. Total triples of this graph is around 50,000,000, I am using on the fly inferencing in the sparql query.
I am using virtuoso as a database. It has inferencing functionality.
When I run query with inferencing, it is taking 80 secs for simple query. and without using inferencing it is taking 10 secs.
Sparql Query:
DEFINE input:inference 'myrule'
select DISTINCT ?uri1 ?uri2
from <GRAPH_NAME>
where {?uri1 rdf:type ezdi:Aspirin.
?patient ezdi:is_treated_with ?uri1.
?patient rdf:type ezdi:Patient.
?uri2 rdf:type ezdi:Hypertension .
?patient ezdi:is_suffering_with ?uri2.
?patient rdf:type ezdi:Patient } ORDER BY ?patient
I have done all the indexing providing by the virtuoso. System has 32 GB RAM.
And I have done NumberOfBuffer setting virtuoso.ini file.
I dont know what is the issue with inferencing. But I have to use Inferencing in the sparql Query.
If u know something then plz share ur idea.
Thank You
An ontology of 5M triples is quite large, though strictly speaking, that's not problematic. Performance with regards to reasoning is far more closely tied to the expressivity of your ontology than it's size. You could create an ontology with several order of magnitude fewer triples that would be harder to reasoning with.
With that said, there's not much I can specifically suggest. Virtuoso specific tuning is best left to their developers, so you might get some traction on their mailing list.
It appears that you're using some custom inferencing "my_rule" -- though in the comments you also claim RDFS & sameAs. You probably need to figure out what reasoning you're actually using, what profile (RDFS or OWL2 QL, RL, EL, DL) that your ontology falls into, and learn a little bit about how reasoning actually works. Further, equality reasoning is difficult, which you claim to be using in addition to RDFS. It might be possible that Virtuoso can compute the equivalence relations eagerly which could reduce the overhead of the query, but again, that is something you should take up with them on their mailing list.
Reasoning is not easy by any means, and there's no silver bullet for magically making reasoning faster beyond using a simpler, ie less expressive, ontology or less data, or both.
Lastly, you might try other databases which are designed for reasoning, such as OWLIM or Stardog. Not all databases are created equal, and it's entirely possible you've encoded something in your TBox which Virtuoso might not handle well, but could be handled easily by another system.
There are many factors which could lead to the performance issue you describe. The most common is to make an error in the NumberOfBuffers setting in the INI file -- which we cannot see, and so cannot diagnose, here.
Questions specifically regarding Virtuoso are generally best raised on the public OpenLink Discussion Forums, the Virtuso Users mailing list, or through a confidential Support Case. If you bring this there, we should be able to help you in more detail.