Spatial SPARQL Queries in RDF4J with Lucene support - lucene

I am testing RDF4J on spatial queries. I have deployed the RDF4J Server and Workbench apps on Appache Tomcat 9.0.12. My current dataset has 853 LineStrings and 88 Polygons represented asWKT fields. But the performance of query varied from type of repository i am working with. Following is the SPARQL query.
PREFIX cpmeta1: <http://meta.icos-cp.eu/ontologies/cpmeta/>
PREFIX geo: <http://www.opengis.net/ont/geosparql#>
PREFIX sf: <http://www.opengis.net/ont/sf#>
PREFIX uom: <http://www.opengis.net/def/uom/OGC/1.0/>
PREFIX geof: <http://www.opengis.net/def/function/geosparql/>
SELECT (count(distinct ?obj1) as ?C)
WHERE {
?obj1 a geo:Feature;
geo:hasGeometry ?geom1.
?geom1 a sf:LineString;
geo:asWKT ?coord1.
?obj2 a geo:Feature;
geo:hasGeometry ?geom2.
?geom2 a sf:Polygon;
geo:asWKT ?coord2.
FILTER(geof:sfWithin(?coord1,?coord2))
}
The query runs fine and the result returns 567 as the count of linestring objects that are within any polygon.
The problem is with the time it takes to return the result. If the repository is of type simple Memory or Native then this query execution time is between 10 to 40 seconds in different iterations. However if the repository was created with Lucene, then the execution time of this query exceeds 30 minutes.
From the documentation of RD4J i gather that with lucene there is supposed to be a spatial index for asWKT fields, therefore i was expecting the spatial join to run in less time with Lucene. But on the contrary it seems that with lucene, the same query performance is deteriorated exponentially.
I tested the same scenario from both: the workbench aap as well as from API in an eclipse project. In both cases the query performance degrades too much when Lucene is involved.
Can anyone advise me what i am missing here ??
Best Regards

Related

Wikidata Virtuoso SPARQL Endpoint - How to get more than 100,000 results

I need to get Wikidata artifacts (instance-types, redirects and disambiguations) for a project.
As the original Wikidata endpoint has time constraints when it comes to querying, I have come across Virtuoso Wikidata endpoint.
The problem I have is that if I try to get for example the redirects with this query, it only returns 100,000 results at most:
PREFIX owl: http://www.w3.org/2002/07/owl#
CONSTRUCT {?resource owl:sameAs ?resource2}
WHERE
{
?resource owl:sameAs ?resource2
}
Iā€™m writing to ask if you know of any way to get more than 100,000 results. I would like to be able to achieve the maximum number of possible results.
Once the results are obtained, I must have 3 files (or as few files as possible) in the Ntriples format: wikidata_intance_types.nt, wikidata_redirecions.nt and wikidata_disambiguations.nt.
Thank you very much in advance.
All the best,
Jose Manuel
Please recognize that in both cases (Wikidata itself, and the Virtuoso instance provided by OpenLink Software, my employer), you are querying against a shared resource, and various limits should be expected.
You should space your queries out over time, and consider smaller chunks than the 100,000 limit you've run into -- perhaps 50,000 at a time, waiting for each query to finish retrieving results, plus another second or ten, before issuing the next query.
Most of the guidance in this article about working with the DBpedia public SPARQL endpoint is relevant for any public SPARQL endpoint, especially those powered by Virtuoso. Specific settings on other endpoints will vary, but if you try to be friendly ā€” by limiting the rate of your queries; limiting the size of partial result sets when using ORDER BY, LIMIT, and OFFSET to step through to get a full result set for a query that overflows the instance's maximum result set size; and the like ā€” you'll be far more successful.
You can get and host your own copy of wikidata as explained in
https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
There are also alternatives to get a partial dump of wikidata e.g. with https://github.com/bennofs/wdumper
Or ask for access to one of the non public copies we run by sending me a personal e-mail via my RWTH Aachen i5 account

GraphDB queries fail silently (OutOfMemoryError)

I'm dealing with a pretty huge repository (i.e., ~16M statements). I'm trying to get a list of all distinct class membership patterns in my graph. Here is the query:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select distinct (group_concat(?t) as ?fo)
where {
?s rdf:type ?t.
}
group by (?s)
order by ?fo
Of course such a query does have to return results. Strangely enough, sometimes I get the result that I need, sometimes the query returns no results.
Why is that happening and how can I avoid that?
PS:
Monitoring the query, I noticed that when I have no data, the query status stays stuck on:
IN_HAS_NEXT
0 operations
until the conclusion.
Update:
Here is the main log describing the problem. The output in the workbench of GraphDB is:
No results. Query took 1m 53s, minutes ago.
No mention of errors. Quite strange. As Gilles-Antoine Nys pointed out, it is a problem with memory and Java's GC. Overall, I think that the workbench should explicitly show an error message in such cases.
First, your query is not able to give you the class that can be defined by the types... I recommenced to add ?s in your select for expressivness. So you can easily answer your question.
Here is the simpliest query for it :
select ?s (group_concat(?t) as ?fo)
where {
?s rdf:type ?t.
} group by (?s)
( order by(?fo) can be added if it is important for you ).
Secondly, the IN_HAS_NEXT is just a state of the query when you pause it in the monitor. Nothing to do with an error.
And finally, verify your query-timeout parameter. Its default value is set to 0.
Update :
Actually you have two errors in your LogFile. The problem is that your JVM is taking too long to Garbage Collector your memory (more than 98% of your memory is still used).
One solution is to increase your GC overhead limit and therefore allows to deal with your huge dataset.
As the other comments already have suggested the error is caused by OME. In the next upcoming GraphDB 8.6 release, the developers made a much more memory efficient implementation of all aggregates and distinct. Until its public release, there are only few options to test:
Decrease the amount of consumed memory by writing a slightly more optimal version of your query, which displays only the local names:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select ?s_local (group_concat(?t_local) as ?fo)
where {
?s rdf:type ?t.
BIND (REPLACE(STR(?t), "^(.*)(/|#)([^#/]*)$", "$3") as ?t_local)
BIND (REPLACE(STR(?s), "^(.*)(/|#)([^#/]*)$", "$3") as ?s_local)
} group by (?s_local)
order by (?fo)
Increase the available RAM for performing all aggregate calculations. You can increase it by either passing a higher value of the -Xmx parameter or setting a minimal number for graphdb.page.cache.size in graphdb.properties like: graphdb.page.cache.size=100M. The cache controls the number of stored in memory pages, which for your query and repository size won't make much of a difference.
Limit the the maximum length of the group concat string with -Dgraphdb.engine.function.concat.max-length=4096. The limit won't make the query to execute successfully, but it will indicate, whether the problem is too many subjects too long strings.
The problem with GraphDB Free is: It increases heap size as per the workbench is called. For example, if you query or just refresh the workbench you could see the Heap size increases. Now even you do have large amount of memory, querying successively larger number of query shall produce OOM as the heap size is increasing continuously. The solution might be to look for temporary data stored by GraphDB on every call which increases the heap size.

Filter by datatype

I'm trying to filter by datatype in DBpedia. For example:
SELECT *
WHERE {?s ?p ?o .
FILTER ( datatype(?o) = xsd:integer)
}
LIMIT 10
But I get no results, while there are certainly integer values. I get the same from other endpoints using Virtuoso, but I do get results from alternative endpoints. What could be the problem? If Virtuoso does not implement this SPARQL function properly, what to use instead?
This is an expensive query, since it has to traverse all triples (?s ?p ?o .). The query's execution time exceeds the maximum time configured for the Virtuoso instance that serves DBpedia's SPARQL endpoint at http://dbpedia.org/sparql.
If you don't use the timeout parameter, then you will get a time out error (Virtuoso S1T00 Error SR171: Transaction timed out). When you use the timeout (by default set to 30000 for the DBpedia endpoint), you will get incomplete results that will contain HTTP headers like these:
X-SQL-State: S1TAT
X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1.389M rnd 5.146M seq 0 same seg 55.39K same pg 195 same par 0 disk 0 spec disk 0B / 0 me
Empty results thus may be incomplete and it doesn't need to indicate there are no xsd:integer literals in DBpedia. A related discussion about the partial results in Virtuoso can be found here.
As a solution to your query, you can load the DBpedia from dumps and analyze it locally.
As a side note, your query is syntactically invalid, because it's missing namespace for the xsd prefix. You can check the syntax of SPARQL queries via SPARQLer Query Validator. You can find the namespaces for common prefixes using Prefix.cc. Virtuoso that provides the DBpedia SPARQL endpoint ignores the missing namespace for the xsd prefix, but it's a good practice to stick to syntactically valid SPARQL queries for greater interoperability.

Unable to use SPARQL SERVICE with FactForge

I am trying to access FactForge from Sesame triplestore. This is the query:
select *
where{
SERVICE <http://factforge.net/sparql>{
?s ?p ?o
}
}
LIMIT 100
The query doesn't get executed. The same structure works with DBpedia. FactForge's SPARQL endpoint on the web is working. What do I need to do to access the endpoint successfully from Sesame?
What you need to do is write a more meaningful (or at least more constrained) query. Your query is just selecting all possible triples, which presumably puts a lot of stress on the factforge endpoint (which contains about 3 billion triples). The reason your query "does not get executed" (which probably means that you are just waiting forever for the query to return a result) is that it takes the SPARQL endpoint a very long time to return its response.
The LIMIT 100 you put on the query is outside the scope of the SERVICE clause, and therefore not actually communicated to the remote endpoint you're querying. While in this particular case it would be possible for Sesame's optimizer to add that (since there are no additional constraints in your query outside the scope of the SERVICE clause), unfortunately it's currently not that smart - so the query sent to factforge is without a limit, and the actual limit is only applied after you get back the result (which, in the case of your "give me all your triples" query, naturally takes a while).
However, demonstrably the SERVICE clause does work for FactForge when used from Sesame, because if you try a slightly more constrained query, for example a query selecting all companies:
PREFIX dbp-ont: <http://dbpedia.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select *
where{
SERVICE <http://factforge.net/sparql>{
?s a dbp-ont:Company
}
} LIMIT 100
it works fine, and you get a response.
More generally speaking, I should recommend that if you want to do queries that are specifically to a particular SPARQL endpoint, you should use a SPARQL endpoint proxy (which is one of the repository types available in Sesame) instead of using the SERVICE clause. SERVICE is only really useful when trying to combine data from your local repository with that of a remote endpoint in a single query. Using a SPARQL endpoint proxy allows you to make sure LIMIT clauses are actually communicated to the endpoint, and just generally will give you better performance than a SERVICE query will.

SPARQL queries causing XDMP-MEMCANCELED

We are performing a series of different SPARQL queries on a database containing ~5 million triples.
Our queries often cause a XDMP-MEMCANCELED-error though not consistently, they mostly return a correct result within a few seconds or less. Some queries occationally seems to hang and cause the server to run at 100% CPU until the query times out.
We have tried increasing what memory-related settings we could find. This query runs fine on other triple stores/engines.
We are running MarkLogic 8.0-11 AWS instance with 8 gb of internal memory and 16 gb of swap space.
Example of a fairly straightforward query that sometimes causes the error:
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX t5_m: <http://url.com/T5/model#>
PREFIX t5_d: <http://url.com/T5/data#>
SELECT DISTINCT
?_app_id
( ?_err as ?_reason )
?_comment
?_severity
WHERE
{
BIND ( 3 as ?_severity)
BIND ( "Generated by HL7 v2 Conformance Profile of IHE PCD-01 message" as ?_comment )
FILTER( ?_app_id = 'APP_ID')
FILTER ( ?_ts >= '2015-04-21T09:04:07.871' )
FILTER ( ?_ts <= '2015-04-21T09:07:43.973' )
?ACK t5_m:hasMSH ?MSH .
?MSH t5_m:hasMSH.5 ?MSH_5 .
?MSH_5 t5_m:hasHD.1 ?HD_1 .
?HD_1 t5_m:hD.1Value ?_app_id .
?ACK t5_m:hasMSA ?MSA .
?MSA t5_m:hasMSA.2 ?MSA_2 .
?MSA_2 t5_m:mSA.2Value ?_msg_id .
?PCD_01_Message a t5_m:PCD_01_Message .
?PCD_01_Message t5_m:id ?_msg_id .
?PCD_01_Message t5_m:timeStamp ?_ts .
?ACK t5_m:hasERR ?ERR .
?ERR t5_m:hasERR.7 ?ERR_7 .
?ERR_7 t5_m:eRR.7Value ?_err .
}
Is there some relevant configuration settings that we have missed or is there something wrong with this query?
Yours truly, Alexander
When the total hash join table size of all running SPARQL queries exceeds 50% of the host memory, the SPARQL query using the most memory for hash joins will be canceled with the "XDMP-MEMCANCELED" error. This may indicate a number of things:
The host is overloaded with the number of simultaneously
executing SPARQL queries. You could try adding more memory to the host
(8Gb is very small) or
load-balancing SPARQL queries across more MarkLogic hosts.
The SPARQL query optimizer is picking poor query plans for your
query. For SPARQL queries with larger numbers of joins (for instance
your query has 13 joins), you could try to execute the sem:sparql()
function with a higher optimization level, ie: try adding the option
"optimize=2".
The query optimizer may have a bug or deficiency that causes the
choice of a poor query plan. You should contact MarkLogic support to
look into this - they will be able to guide you to find out the
query plan chosen and the optimization parameters that lead to your
problem. They may be able to suggest a solution given more
information, or they may file a bug report for a fix to be made.
John