Virtuoso 42000 Error The estimated execution time - sparql

Using DBpedia-Live SPARQL endpoint http://dbpedia-live.openlinksw.com/sparql, I am trying to count the total number of triples associated with the instances of type owl:Thing. As the count is really big, an exception is being thrown "Virtuoso 42000 Error The estimated execution time". To get rid of this I tried to use subselect, limit, and offset in the query. However, when the offset is greater than equal to the limit, the solution isn't working and the same exception is being thrown again (Virtuoso 42000 Error), can anyone please identify the problem with my query? Or suggest a workaround? Provided is the query I was trying:
select count(?s) as ?count
where
{
?s ?p ?o
{
select ?s
where
{
?s rdf:type owl:Thing.
}
limit 10000
offset 10000
}
}

Your solution starts with patience. Virtuoso's Anytime Query feature returns some results when a timeout strikes, and keeps running the query in the background -- so if you come back later, you'll typically get more solutions, up to the complete result set.
I had to guess at your original query, since you only posted the piecemeal one you were trying to use --
select ( count(?s) as ?count )
where
{
?s rdf:type owl:Thing.
}
I got 3,923,114 within a few seconds, without hitting any timeout. I had set a timeout of 3000000 milliseconds (= 3000 seconds = 50 minutes) on the form -- in contrast to the endpoint's default timeout of 30000 milliseconds (= 30 seconds) -- but clearly hit neither of these, nor the endpoint's server-side configured timeout.
I think you already understand this, but please do note that this count is a moving target, and will change regularly as the DBpedia-Live content continues to be updated from the Wikipedia firehose.
Your divide-and-conquer effort has a significant issue. Note that without an ORDER BY clause in combination with your LIMIT/OFFSET clauses, you may find that some solutions (in this case, some values of ?s) repeat and/or some solutions never appear in a final aggregation that combines all those partial results.
Also, as you are trying to count triples, you should probably do a count(*) instead of count (?s). If nothing else, this helps readers of the query understand what you're doing.
Toward being able to adjust such execution time limits as your query is hitting -- the easiest way would be to instantiate your own mirror via the the DBpedia-Live AMI; unfortunately, this is not currently available for new customers, for a number of reasons. (Existing customers may continue to use their AMIs.) We will likely revive this at some point, but the timing is indefinite; you could open a Support Case to register your interest, and be notified when the AMI is made available for new users.
Toward an ultimate solution... There may be better ways to get to your actual end goal than those you're currently working on. You might consider asking on the DBpedia mailing list or the OpenLink Community Forum.

Related

Wikidata Virtuoso SPARQL Endpoint - How to get more than 100,000 results

I need to get Wikidata artifacts (instance-types, redirects and disambiguations) for a project.
As the original Wikidata endpoint has time constraints when it comes to querying, I have come across Virtuoso Wikidata endpoint.
The problem I have is that if I try to get for example the redirects with this query, it only returns 100,000 results at most:
PREFIX owl: http://www.w3.org/2002/07/owl#
CONSTRUCT {?resource owl:sameAs ?resource2}
WHERE
{
?resource owl:sameAs ?resource2
}
Iā€™m writing to ask if you know of any way to get more than 100,000 results. I would like to be able to achieve the maximum number of possible results.
Once the results are obtained, I must have 3 files (or as few files as possible) in the Ntriples format: wikidata_intance_types.nt, wikidata_redirecions.nt and wikidata_disambiguations.nt.
Thank you very much in advance.
All the best,
Jose Manuel
Please recognize that in both cases (Wikidata itself, and the Virtuoso instance provided by OpenLink Software, my employer), you are querying against a shared resource, and various limits should be expected.
You should space your queries out over time, and consider smaller chunks than the 100,000 limit you've run into -- perhaps 50,000 at a time, waiting for each query to finish retrieving results, plus another second or ten, before issuing the next query.
Most of the guidance in this article about working with the DBpedia public SPARQL endpoint is relevant for any public SPARQL endpoint, especially those powered by Virtuoso. Specific settings on other endpoints will vary, but if you try to be friendly ā€” by limiting the rate of your queries; limiting the size of partial result sets when using ORDER BY, LIMIT, and OFFSET to step through to get a full result set for a query that overflows the instance's maximum result set size; and the like ā€” you'll be far more successful.
You can get and host your own copy of wikidata as explained in
https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
There are also alternatives to get a partial dump of wikidata e.g. with https://github.com/bennofs/wdumper
Or ask for access to one of the non public copies we run by sending me a personal e-mail via my RWTH Aachen i5 account

GraphDB queries fail silently (OutOfMemoryError)

I'm dealing with a pretty huge repository (i.e., ~16M statements). I'm trying to get a list of all distinct class membership patterns in my graph. Here is the query:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select distinct (group_concat(?t) as ?fo)
where {
?s rdf:type ?t.
}
group by (?s)
order by ?fo
Of course such a query does have to return results. Strangely enough, sometimes I get the result that I need, sometimes the query returns no results.
Why is that happening and how can I avoid that?
PS:
Monitoring the query, I noticed that when I have no data, the query status stays stuck on:
IN_HAS_NEXT
0 operations
until the conclusion.
Update:
Here is the main log describing the problem. The output in the workbench of GraphDB is:
No results. Query took 1m 53s, minutes ago.
No mention of errors. Quite strange. As Gilles-Antoine Nys pointed out, it is a problem with memory and Java's GC. Overall, I think that the workbench should explicitly show an error message in such cases.
First, your query is not able to give you the class that can be defined by the types... I recommenced to add ?s in your select for expressivness. So you can easily answer your question.
Here is the simpliest query for it :
select ?s (group_concat(?t) as ?fo)
where {
?s rdf:type ?t.
} group by (?s)
( order by(?fo) can be added if it is important for you ).
Secondly, the IN_HAS_NEXT is just a state of the query when you pause it in the monitor. Nothing to do with an error.
And finally, verify your query-timeout parameter. Its default value is set to 0.
Update :
Actually you have two errors in your LogFile. The problem is that your JVM is taking too long to Garbage Collector your memory (more than 98% of your memory is still used).
One solution is to increase your GC overhead limit and therefore allows to deal with your huge dataset.
As the other comments already have suggested the error is caused by OME. In the next upcoming GraphDB 8.6 release, the developers made a much more memory efficient implementation of all aggregates and distinct. Until its public release, there are only few options to test:
Decrease the amount of consumed memory by writing a slightly more optimal version of your query, which displays only the local names:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select ?s_local (group_concat(?t_local) as ?fo)
where {
?s rdf:type ?t.
BIND (REPLACE(STR(?t), "^(.*)(/|#)([^#/]*)$", "$3") as ?t_local)
BIND (REPLACE(STR(?s), "^(.*)(/|#)([^#/]*)$", "$3") as ?s_local)
} group by (?s_local)
order by (?fo)
Increase the available RAM for performing all aggregate calculations. You can increase it by either passing a higher value of the -Xmx parameter or setting a minimal number for graphdb.page.cache.size in graphdb.properties like: graphdb.page.cache.size=100M. The cache controls the number of stored in memory pages, which for your query and repository size won't make much of a difference.
Limit the the maximum length of the group concat string with -Dgraphdb.engine.function.concat.max-length=4096. The limit won't make the query to execute successfully, but it will indicate, whether the problem is too many subjects too long strings.
The problem with GraphDB Free is: It increases heap size as per the workbench is called. For example, if you query or just refresh the workbench you could see the Heap size increases. Now even you do have large amount of memory, querying successively larger number of query shall produce OOM as the heap size is increasing continuously. The solution might be to look for temporary data stored by GraphDB on every call which increases the heap size.

Filter by datatype

I'm trying to filter by datatype in DBpedia. For example:
SELECT *
WHERE {?s ?p ?o .
FILTER ( datatype(?o) = xsd:integer)
}
LIMIT 10
But I get no results, while there are certainly integer values. I get the same from other endpoints using Virtuoso, but I do get results from alternative endpoints. What could be the problem? If Virtuoso does not implement this SPARQL function properly, what to use instead?
This is an expensive query, since it has to traverse all triples (?s ?p ?o .). The query's execution time exceeds the maximum time configured for the Virtuoso instance that serves DBpedia's SPARQL endpoint at http://dbpedia.org/sparql.
If you don't use the timeout parameter, then you will get a time out error (Virtuoso S1T00 Error SR171: Transaction timed out). When you use the timeout (by default set to 30000 for the DBpedia endpoint), you will get incomplete results that will contain HTTP headers like these:
X-SQL-State: S1TAT
X-SQL-Message: RC...: Returning incomplete results, query interrupted by result timeout. Activity: 1.389M rnd 5.146M seq 0 same seg 55.39K same pg 195 same par 0 disk 0 spec disk 0B / 0 me
Empty results thus may be incomplete and it doesn't need to indicate there are no xsd:integer literals in DBpedia. A related discussion about the partial results in Virtuoso can be found here.
As a solution to your query, you can load the DBpedia from dumps and analyze it locally.
As a side note, your query is syntactically invalid, because it's missing namespace for the xsd prefix. You can check the syntax of SPARQL queries via SPARQLer Query Validator. You can find the namespaces for common prefixes using Prefix.cc. Virtuoso that provides the DBpedia SPARQL endpoint ignores the missing namespace for the xsd prefix, but it's a good practice to stick to syntactically valid SPARQL queries for greater interoperability.

Unable to use SPARQL SERVICE with FactForge

I am trying to access FactForge from Sesame triplestore. This is the query:
select *
where{
SERVICE <http://factforge.net/sparql>{
?s ?p ?o
}
}
LIMIT 100
The query doesn't get executed. The same structure works with DBpedia. FactForge's SPARQL endpoint on the web is working. What do I need to do to access the endpoint successfully from Sesame?
What you need to do is write a more meaningful (or at least more constrained) query. Your query is just selecting all possible triples, which presumably puts a lot of stress on the factforge endpoint (which contains about 3 billion triples). The reason your query "does not get executed" (which probably means that you are just waiting forever for the query to return a result) is that it takes the SPARQL endpoint a very long time to return its response.
The LIMIT 100 you put on the query is outside the scope of the SERVICE clause, and therefore not actually communicated to the remote endpoint you're querying. While in this particular case it would be possible for Sesame's optimizer to add that (since there are no additional constraints in your query outside the scope of the SERVICE clause), unfortunately it's currently not that smart - so the query sent to factforge is without a limit, and the actual limit is only applied after you get back the result (which, in the case of your "give me all your triples" query, naturally takes a while).
However, demonstrably the SERVICE clause does work for FactForge when used from Sesame, because if you try a slightly more constrained query, for example a query selecting all companies:
PREFIX dbp-ont: <http://dbpedia.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select *
where{
SERVICE <http://factforge.net/sparql>{
?s a dbp-ont:Company
}
} LIMIT 100
it works fine, and you get a response.
More generally speaking, I should recommend that if you want to do queries that are specifically to a particular SPARQL endpoint, you should use a SPARQL endpoint proxy (which is one of the repository types available in Sesame) instead of using the SERVICE clause. SERVICE is only really useful when trying to combine data from your local repository with that of a remote endpoint in a single query. Using a SPARQL endpoint proxy allows you to make sure LIMIT clauses are actually communicated to the endpoint, and just generally will give you better performance than a SERVICE query will.

SPARQL queries causing XDMP-MEMCANCELED

We are performing a series of different SPARQL queries on a database containing ~5 million triples.
Our queries often cause a XDMP-MEMCANCELED-error though not consistently, they mostly return a correct result within a few seconds or less. Some queries occationally seems to hang and cause the server to run at 100% CPU until the query times out.
We have tried increasing what memory-related settings we could find. This query runs fine on other triple stores/engines.
We are running MarkLogic 8.0-11 AWS instance with 8 gb of internal memory and 16 gb of swap space.
Example of a fairly straightforward query that sometimes causes the error:
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX t5_m: <http://url.com/T5/model#>
PREFIX t5_d: <http://url.com/T5/data#>
SELECT DISTINCT
?_app_id
( ?_err as ?_reason )
?_comment
?_severity
WHERE
{
BIND ( 3 as ?_severity)
BIND ( "Generated by HL7 v2 Conformance Profile of IHE PCD-01 message" as ?_comment )
FILTER( ?_app_id = 'APP_ID')
FILTER ( ?_ts >= '2015-04-21T09:04:07.871' )
FILTER ( ?_ts <= '2015-04-21T09:07:43.973' )
?ACK t5_m:hasMSH ?MSH .
?MSH t5_m:hasMSH.5 ?MSH_5 .
?MSH_5 t5_m:hasHD.1 ?HD_1 .
?HD_1 t5_m:hD.1Value ?_app_id .
?ACK t5_m:hasMSA ?MSA .
?MSA t5_m:hasMSA.2 ?MSA_2 .
?MSA_2 t5_m:mSA.2Value ?_msg_id .
?PCD_01_Message a t5_m:PCD_01_Message .
?PCD_01_Message t5_m:id ?_msg_id .
?PCD_01_Message t5_m:timeStamp ?_ts .
?ACK t5_m:hasERR ?ERR .
?ERR t5_m:hasERR.7 ?ERR_7 .
?ERR_7 t5_m:eRR.7Value ?_err .
}
Is there some relevant configuration settings that we have missed or is there something wrong with this query?
Yours truly, Alexander
When the total hash join table size of all running SPARQL queries exceeds 50% of the host memory, the SPARQL query using the most memory for hash joins will be canceled with the "XDMP-MEMCANCELED" error. This may indicate a number of things:
The host is overloaded with the number of simultaneously
executing SPARQL queries. You could try adding more memory to the host
(8Gb is very small) or
load-balancing SPARQL queries across more MarkLogic hosts.
The SPARQL query optimizer is picking poor query plans for your
query. For SPARQL queries with larger numbers of joins (for instance
your query has 13 joins), you could try to execute the sem:sparql()
function with a higher optimization level, ie: try adding the option
"optimize=2".
The query optimizer may have a bug or deficiency that causes the
choice of a poor query plan. You should contact MarkLogic support to
look into this - they will be able to guide you to find out the
query plan chosen and the optimization parameters that lead to your
problem. They may be able to suggest a solution given more
information, or they may file a bug report for a fix to be made.
John