GraphDB queries fail silently (OutOfMemoryError) - sparql

I'm dealing with a pretty huge repository (i.e., ~16M statements). I'm trying to get a list of all distinct class membership patterns in my graph. Here is the query:
PREFIX rdf: <>
select distinct (group_concat(?t) as ?fo)
where {
?s rdf:type ?t.
group by (?s)
order by ?fo
Of course such a query does have to return results. Strangely enough, sometimes I get the result that I need, sometimes the query returns no results.
Why is that happening and how can I avoid that?
Monitoring the query, I noticed that when I have no data, the query status stays stuck on:
0 operations
until the conclusion.
Here is the main log describing the problem. The output in the workbench of GraphDB is:
No results. Query took 1m 53s, minutes ago.
No mention of errors. Quite strange. As Gilles-Antoine Nys pointed out, it is a problem with memory and Java's GC. Overall, I think that the workbench should explicitly show an error message in such cases.

First, your query is not able to give you the class that can be defined by the types... I recommenced to add ?s in your select for expressivness. So you can easily answer your question.
Here is the simpliest query for it :
select ?s (group_concat(?t) as ?fo)
where {
?s rdf:type ?t.
} group by (?s)
( order by(?fo) can be added if it is important for you ).
Secondly, the IN_HAS_NEXT is just a state of the query when you pause it in the monitor. Nothing to do with an error.
And finally, verify your query-timeout parameter. Its default value is set to 0.
Update :
Actually you have two errors in your LogFile. The problem is that your JVM is taking too long to Garbage Collector your memory (more than 98% of your memory is still used).
One solution is to increase your GC overhead limit and therefore allows to deal with your huge dataset.

As the other comments already have suggested the error is caused by OME. In the next upcoming GraphDB 8.6 release, the developers made a much more memory efficient implementation of all aggregates and distinct. Until its public release, there are only few options to test:
Decrease the amount of consumed memory by writing a slightly more optimal version of your query, which displays only the local names:
PREFIX rdf: <>
select ?s_local (group_concat(?t_local) as ?fo)
where {
?s rdf:type ?t.
BIND (REPLACE(STR(?t), "^(.*)(/|#)([^#/]*)$", "$3") as ?t_local)
BIND (REPLACE(STR(?s), "^(.*)(/|#)([^#/]*)$", "$3") as ?s_local)
} group by (?s_local)
order by (?fo)
Increase the available RAM for performing all aggregate calculations. You can increase it by either passing a higher value of the -Xmx parameter or setting a minimal number for in like: The cache controls the number of stored in memory pages, which for your query and repository size won't make much of a difference.
Limit the the maximum length of the group concat string with -Dgraphdb.engine.function.concat.max-length=4096. The limit won't make the query to execute successfully, but it will indicate, whether the problem is too many subjects too long strings.

The problem with GraphDB Free is: It increases heap size as per the workbench is called. For example, if you query or just refresh the workbench you could see the Heap size increases. Now even you do have large amount of memory, querying successively larger number of query shall produce OOM as the heap size is increasing continuously. The solution might be to look for temporary data stored by GraphDB on every call which increases the heap size.


Wikidata Virtuoso SPARQL Endpoint - How to get more than 100,000 results

I need to get Wikidata artifacts (instance-types, redirects and disambiguations) for a project.
As the original Wikidata endpoint has time constraints when it comes to querying, I have come across Virtuoso Wikidata endpoint.
The problem I have is that if I try to get for example the redirects with this query, it only returns 100,000 results at most:
CONSTRUCT {?resource owl:sameAs ?resource2}
?resource owl:sameAs ?resource2
Iā€™m writing to ask if you know of any way to get more than 100,000 results. I would like to be able to achieve the maximum number of possible results.
Once the results are obtained, I must have 3 files (or as few files as possible) in the Ntriples format: wikidata_intance_types.nt, wikidata_redirecions.nt and wikidata_disambiguations.nt.
Thank you very much in advance.
All the best,
Jose Manuel
Please recognize that in both cases (Wikidata itself, and the Virtuoso instance provided by OpenLink Software, my employer), you are querying against a shared resource, and various limits should be expected.
You should space your queries out over time, and consider smaller chunks than the 100,000 limit you've run into -- perhaps 50,000 at a time, waiting for each query to finish retrieving results, plus another second or ten, before issuing the next query.
Most of the guidance in this article about working with the DBpedia public SPARQL endpoint is relevant for any public SPARQL endpoint, especially those powered by Virtuoso. Specific settings on other endpoints will vary, but if you try to be friendly ā€” by limiting the rate of your queries; limiting the size of partial result sets when using ORDER BY, LIMIT, and OFFSET to step through to get a full result set for a query that overflows the instance's maximum result set size; and the like ā€” you'll be far more successful.
You can get and host your own copy of wikidata as explained in
There are also alternatives to get a partial dump of wikidata e.g. with
Or ask for access to one of the non public copies we run by sending me a personal e-mail via my RWTH Aachen i5 account

Virtuoso 42000 Error The estimated execution time

Using DBpedia-Live SPARQL endpoint, I am trying to count the total number of triples associated with the instances of type owl:Thing. As the count is really big, an exception is being thrown "Virtuoso 42000 Error The estimated execution time". To get rid of this I tried to use subselect, limit, and offset in the query. However, when the offset is greater than equal to the limit, the solution isn't working and the same exception is being thrown again (Virtuoso 42000 Error), can anyone please identify the problem with my query? Or suggest a workaround? Provided is the query I was trying:
select count(?s) as ?count
?s ?p ?o
select ?s
?s rdf:type owl:Thing.
limit 10000
offset 10000
Your solution starts with patience. Virtuoso's Anytime Query feature returns some results when a timeout strikes, and keeps running the query in the background -- so if you come back later, you'll typically get more solutions, up to the complete result set.
I had to guess at your original query, since you only posted the piecemeal one you were trying to use --
select ( count(?s) as ?count )
?s rdf:type owl:Thing.
I got 3,923,114 within a few seconds, without hitting any timeout. I had set a timeout of 3000000 milliseconds (= 3000 seconds = 50 minutes) on the form -- in contrast to the endpoint's default timeout of 30000 milliseconds (= 30 seconds) -- but clearly hit neither of these, nor the endpoint's server-side configured timeout.
I think you already understand this, but please do note that this count is a moving target, and will change regularly as the DBpedia-Live content continues to be updated from the Wikipedia firehose.
Your divide-and-conquer effort has a significant issue. Note that without an ORDER BY clause in combination with your LIMIT/OFFSET clauses, you may find that some solutions (in this case, some values of ?s) repeat and/or some solutions never appear in a final aggregation that combines all those partial results.
Also, as you are trying to count triples, you should probably do a count(*) instead of count (?s). If nothing else, this helps readers of the query understand what you're doing.
Toward being able to adjust such execution time limits as your query is hitting -- the easiest way would be to instantiate your own mirror via the the DBpedia-Live AMI; unfortunately, this is not currently available for new customers, for a number of reasons. (Existing customers may continue to use their AMIs.) We will likely revive this at some point, but the timing is indefinite; you could open a Support Case to register your interest, and be notified when the AMI is made available for new users.
Toward an ultimate solution... There may be better ways to get to your actual end goal than those you're currently working on. You might consider asking on the DBpedia mailing list or the OpenLink Community Forum.

ARRAY_AGG leading to OOM

I'm trying to run a pretty simple query but it's failing with an Resources exceeded error.
I read in another post that the heuristic used to allocate the number of mixers could fail from time to time.
ARRAY_AGG(response) AS responses
Is there a way to fix my query knowing that:
a response is composed of 38 fields (most of them being short strings)
the max(count()) of a single response is kind of low (165)
Query Failed
Error: Resources exceeded during query execution.
Job ID: teads-1307:bquijob_257ce97b_1566a6a3f27
It's a current limitation that arrays (produced by ARRAY_AGG or other means) must fit in the memory of a single machine. We've made a couple of recent improvements that should help to reduce the resources required for queries such as this, however. To confirm whether this is the issue, you could try a query such as:
SUM(LENGTH(FORMAT("%t", response))) AS total_response_size
ORDER BY total_response_size DESC LIMIT 1;
This formats the structs as strings as a rough heuristic of how much memory they would take to represent. If the result is very large, then perhaps we can restructure the query to use less memory. If the result is not very large, then some other issue is at play, and we'll look into getting it fixed :) Thanks!


We are performing a series of different SPARQL queries on a database containing ~5 million triples.
Our queries often cause a XDMP-MEMCANCELED-error though not consistently, they mostly return a correct result within a few seconds or less. Some queries occationally seems to hang and cause the server to run at 100% CPU until the query times out.
We have tried increasing what memory-related settings we could find. This query runs fine on other triple stores/engines.
We are running MarkLogic 8.0-11 AWS instance with 8 gb of internal memory and 16 gb of swap space.
Example of a fairly straightforward query that sometimes causes the error:
PREFIX xsd: <>
PREFIX t5_m: <>
PREFIX t5_d: <>
( ?_err as ?_reason )
BIND ( 3 as ?_severity)
BIND ( "Generated by HL7 v2 Conformance Profile of IHE PCD-01 message" as ?_comment )
FILTER( ?_app_id = 'APP_ID')
FILTER ( ?_ts >= '2015-04-21T09:04:07.871' )
FILTER ( ?_ts <= '2015-04-21T09:07:43.973' )
?ACK t5_m:hasMSH ?MSH .
?MSH t5_m:hasMSH.5 ?MSH_5 .
?MSH_5 t5_m:hasHD.1 ?HD_1 .
?HD_1 t5_m:hD.1Value ?_app_id .
?ACK t5_m:hasMSA ?MSA .
?MSA t5_m:hasMSA.2 ?MSA_2 .
?MSA_2 t5_m:mSA.2Value ?_msg_id .
?PCD_01_Message a t5_m:PCD_01_Message .
?PCD_01_Message t5_m:id ?_msg_id .
?PCD_01_Message t5_m:timeStamp ?_ts .
?ACK t5_m:hasERR ?ERR .
?ERR t5_m:hasERR.7 ?ERR_7 .
?ERR_7 t5_m:eRR.7Value ?_err .
Is there some relevant configuration settings that we have missed or is there something wrong with this query?
Yours truly, Alexander
When the total hash join table size of all running SPARQL queries exceeds 50% of the host memory, the SPARQL query using the most memory for hash joins will be canceled with the "XDMP-MEMCANCELED" error. This may indicate a number of things:
The host is overloaded with the number of simultaneously
executing SPARQL queries. You could try adding more memory to the host
(8Gb is very small) or
load-balancing SPARQL queries across more MarkLogic hosts.
The SPARQL query optimizer is picking poor query plans for your
query. For SPARQL queries with larger numbers of joins (for instance
your query has 13 joins), you could try to execute the sem:sparql()
function with a higher optimization level, ie: try adding the option
The query optimizer may have a bug or deficiency that causes the
choice of a poor query plan. You should contact MarkLogic support to
look into this - they will be able to guide you to find out the
query plan chosen and the optimization parameters that lead to your
problem. They may be able to suggest a solution given more
information, or they may file a bug report for a fix to be made.

How to resolve the execution limits in Linkedmdb

I was trying to extract all movies from Linkedmdb. I used OFFSET to make sure I wont hit the maximum number of results per query. I used the following scrip in python
PREFIX rdfs: <>
PREFIX movie: <>
SELECT distinct ?film
?film a movie:film .
} LIMIT 1000 OFFSET %s """ %i
I looped 5 times, with offsets being 0,1000,2000,3000,4000 and recorded the number of results. It was (1000,1000,500,0,0). I already knew the limit was 2500 but I thought by using OFFSET, we can get away with this.
Is it no true? There is no way to get all the data (even when we use a loop of some sort)?
Your current query is legal, but but there's no specified ordering, so the offset doesn't bring you to a predictable place in the results. (A lazy implementation could just return the same results over and over again.) When you use limit and offset, you need to also use order by. The SPARQL 1.1 specification says (emphasis added):
OFFSET causes the solutions generated to start after the specified
number of solutions. An OFFSET of zero has no effect.
Using LIMIT and OFFSET to select different subsets of the query
solutions will not be useful unless the order is made predictable by
using ORDER BY.