Running SPARQL query against WikiData dump - sparql

I have a series of simple but exhaustive SPARQL queries. Running them against public SPARQL endpoint of WikiData results in timeouts. Setting up local instance of WikiData would be serious investment not worth this time. So I started with a simple solution:
I use SPARQL WikiData endpoint to explore data, tune the query and evaluate its results. I use LIMIT 100 to avoid timeouts
Once I got my query tuned, I translate it manually to a set of series of JSON paths queries, Python filters, etc. to run them over my local dump of WikiData.
I run them locally. It takes time to process whole dump sequentially, but works.
Second step is error-prone and time-consuming. Is there an automatic solution that can execute SPARQL queries (or rather subset of SPARQL) over a local dump without setting up database?
My SPARQL queries are pretty simple: they extract entities based on their properties and values. I do not build large graphs, I do not use any transitive properties.

Related

Performant method to get triple statistic or count in a Apache Jena Fuseki repository

I have a requirement where I need to count/estimate number of triples in a repository. At the moment I am using SPARQL query to count number of triples in a repository but this is slowing down as size is growing. Is there any better way to get the same information without executing query? I was wondering if it possible to get estimate? My thinking is that under the hood query planner might have this information. So one possibility could be to write an extension function or custom plugin to get the stat from planner.
Will really appreciate if anyone can give idea about the performant way to get number/estimate of triples in a repository. I do not need filtering. All I need is total number of triples. Even estimate will be fine as long as it is not too far off from the actual number of triples.

Parallel Write Operations in GraphDB Free vs. Standard Edition

I am trying to run several SPARQL queries in parallel in Ontotext GraphDB. These queries are identical except for the named graph which they read from. I've attempted a multithreading solution in Scala to launch 3 queries against the database in parallel (see picture below).
The issue is that I am using the Free Edition of GraphDB, which only supports a single core for write operations. What this seems to mean is that the queries which are supposed to run in parallel basically just queue up to run against the single core. As you can see, the first query has completed 41,145 operations in 12 seconds, but no operations have completed on the two other queries. Once the first query completes, the second query will run to completion, and once that completes, the third will run.
I understand this is likely expected behavior for the Free Edition. My question is, will upgrading to the Standard Edition fix this problem and allow the queries to actually run in parallel? Based on the documentation I've looked at, it seems that multiple cores can be made available for the Standard Edition to complete write operations. However, I also saw something which implied that single write queries launched against the Standard Edition would automatically be processed over multiple cores, which might make the multithreading approach obsolete anyway?
Anyone have experience with launching parallel write operations against GraphDB and is able to weigh in?
You could find the difference in the official GraphDB 9.1 benchmark statistics page:
http://graphdb.ontotext.com/documentation/standard/benchmark.html.

Does an ontology has any impact on the Marklogic triple index?

In Marklogic 9, with a very large set of triples loaded, some SPARQL queries were very slow even if the triple index option was enabled. At that time, billions of data triples were loaded but no ontology triples at all were. After the load of the ontologies, the performance has improved a lot.
I don't believe that's because of the ontologies because my queries are not referring to them at all. But it seems that the triple index were only effective after the load of the ontologies. This is the first time I encountered such a situation. Usually, the data triples can be queried effectively without any ontology.
Any clue why?
This is just a coincidence. There must be some other explanation for the slower/faster queries.

MarkLogic memory limit for SPARQL?

I am in the situation to test several "select *" kind of SPARQL queries
against few TB of triple data, of course not in production.
However, we only have limited machine resources (4GB memory) to test the queries.
I understand that it requires more memory, but are there any alternatives to
run queries and get results? (Time consuming is welcome)
My laptop has 32GB of RAM, so that sounds under-resourced even for a dev server. Having said that, for any particular query I would look for ways to reduce the number of triples you're running against. Are your triples segmented into graphs, and if so, can a query be directed against one graph? Another reducing strategy is to use the $query parameter to sem:sparql to identify documents that hold the triples you care about.

How can I increase performance of sparql query while using inferencing?

I want to increase performance of my sparql queries. I have to run all type of sparql query.
I have total 17,500,000 triples in the graph and i have other graph containg only knowledge. this graph containing same as and subclassOf property. Total triples of this graph is around 50,000,000, I am using on the fly inferencing in the sparql query.
I am using virtuoso as a database. It has inferencing functionality.
When I run query with inferencing, it is taking 80 secs for simple query. and without using inferencing it is taking 10 secs.
Sparql Query:
DEFINE input:inference 'myrule'
select DISTINCT ?uri1 ?uri2
from <GRAPH_NAME>
where {?uri1 rdf:type ezdi:Aspirin.
?patient ezdi:is_treated_with ?uri1.
?patient rdf:type ezdi:Patient.
?uri2 rdf:type ezdi:Hypertension .
?patient ezdi:is_suffering_with ?uri2.
?patient rdf:type ezdi:Patient } ORDER BY ?patient
I have done all the indexing providing by the virtuoso. System has 32 GB RAM.
And I have done NumberOfBuffer setting virtuoso.ini file.
I dont know what is the issue with inferencing. But I have to use Inferencing in the sparql Query.
If u know something then plz share ur idea.
Thank You
An ontology of 5M triples is quite large, though strictly speaking, that's not problematic. Performance with regards to reasoning is far more closely tied to the expressivity of your ontology than it's size. You could create an ontology with several order of magnitude fewer triples that would be harder to reasoning with.
With that said, there's not much I can specifically suggest. Virtuoso specific tuning is best left to their developers, so you might get some traction on their mailing list.
It appears that you're using some custom inferencing "my_rule" -- though in the comments you also claim RDFS & sameAs. You probably need to figure out what reasoning you're actually using, what profile (RDFS or OWL2 QL, RL, EL, DL) that your ontology falls into, and learn a little bit about how reasoning actually works. Further, equality reasoning is difficult, which you claim to be using in addition to RDFS. It might be possible that Virtuoso can compute the equivalence relations eagerly which could reduce the overhead of the query, but again, that is something you should take up with them on their mailing list.
Reasoning is not easy by any means, and there's no silver bullet for magically making reasoning faster beyond using a simpler, ie less expressive, ontology or less data, or both.
Lastly, you might try other databases which are designed for reasoning, such as OWLIM or Stardog. Not all databases are created equal, and it's entirely possible you've encoded something in your TBox which Virtuoso might not handle well, but could be handled easily by another system.
There are many factors which could lead to the performance issue you describe. The most common is to make an error in the NumberOfBuffers setting in the INI file -- which we cannot see, and so cannot diagnose, here.
Questions specifically regarding Virtuoso are generally best raised on the public OpenLink Discussion Forums, the Virtuso Users mailing list, or through a confidential Support Case. If you bring this there, we should be able to help you in more detail.