I was trying to extract all movies from Linkedmdb. I used OFFSET to make sure I wont hit the maximum number of results per query. I used the following scrip in python
"""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
SELECT distinct ?film
WHERE {
?film a movie:film .
} LIMIT 1000 OFFSET %s """ %i
I looped 5 times, with offsets being 0,1000,2000,3000,4000 and recorded the number of results. It was (1000,1000,500,0,0). I already knew the limit was 2500 but I thought by using OFFSET, we can get away with this.
Is it no true? There is no way to get all the data (even when we use a loop of some sort)?
Your current query is legal, but but there's no specified ordering, so the offset doesn't bring you to a predictable place in the results. (A lazy implementation could just return the same results over and over again.) When you use limit and offset, you need to also use order by. The SPARQL 1.1 specification says (emphasis added):
15.4 OFFSET
OFFSET causes the solutions generated to start after the specified
number of solutions. An OFFSET of zero has no effect.
Using LIMIT and OFFSET to select different subsets of the query
solutions will not be useful unless the order is made predictable by
using ORDER BY.
Related
Using DBpedia-Live SPARQL endpoint http://dbpedia-live.openlinksw.com/sparql, I am trying to count the total number of triples associated with the instances of type owl:Thing. As the count is really big, an exception is being thrown "Virtuoso 42000 Error The estimated execution time". To get rid of this I tried to use subselect, limit, and offset in the query. However, when the offset is greater than equal to the limit, the solution isn't working and the same exception is being thrown again (Virtuoso 42000 Error), can anyone please identify the problem with my query? Or suggest a workaround? Provided is the query I was trying:
select count(?s) as ?count
where
{
?s ?p ?o
{
select ?s
where
{
?s rdf:type owl:Thing.
}
limit 10000
offset 10000
}
}
Your solution starts with patience. Virtuoso's Anytime Query feature returns some results when a timeout strikes, and keeps running the query in the background -- so if you come back later, you'll typically get more solutions, up to the complete result set.
I had to guess at your original query, since you only posted the piecemeal one you were trying to use --
select ( count(?s) as ?count )
where
{
?s rdf:type owl:Thing.
}
I got 3,923,114 within a few seconds, without hitting any timeout. I had set a timeout of 3000000 milliseconds (= 3000 seconds = 50 minutes) on the form -- in contrast to the endpoint's default timeout of 30000 milliseconds (= 30 seconds) -- but clearly hit neither of these, nor the endpoint's server-side configured timeout.
I think you already understand this, but please do note that this count is a moving target, and will change regularly as the DBpedia-Live content continues to be updated from the Wikipedia firehose.
Your divide-and-conquer effort has a significant issue. Note that without an ORDER BY clause in combination with your LIMIT/OFFSET clauses, you may find that some solutions (in this case, some values of ?s) repeat and/or some solutions never appear in a final aggregation that combines all those partial results.
Also, as you are trying to count triples, you should probably do a count(*) instead of count (?s). If nothing else, this helps readers of the query understand what you're doing.
Toward being able to adjust such execution time limits as your query is hitting -- the easiest way would be to instantiate your own mirror via the the DBpedia-Live AMI; unfortunately, this is not currently available for new customers, for a number of reasons. (Existing customers may continue to use their AMIs.) We will likely revive this at some point, but the timing is indefinite; you could open a Support Case to register your interest, and be notified when the AMI is made available for new users.
Toward an ultimate solution... There may be better ways to get to your actual end goal than those you're currently working on. You might consider asking on the DBpedia mailing list or the OpenLink Community Forum.
I'm dealing with a pretty huge repository (i.e., ~16M statements). I'm trying to get a list of all distinct class membership patterns in my graph. Here is the query:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select distinct (group_concat(?t) as ?fo)
where {
?s rdf:type ?t.
}
group by (?s)
order by ?fo
Of course such a query does have to return results. Strangely enough, sometimes I get the result that I need, sometimes the query returns no results.
Why is that happening and how can I avoid that?
PS:
Monitoring the query, I noticed that when I have no data, the query status stays stuck on:
IN_HAS_NEXT
0 operations
until the conclusion.
Update:
Here is the main log describing the problem. The output in the workbench of GraphDB is:
No results. Query took 1m 53s, minutes ago.
No mention of errors. Quite strange. As Gilles-Antoine Nys pointed out, it is a problem with memory and Java's GC. Overall, I think that the workbench should explicitly show an error message in such cases.
First, your query is not able to give you the class that can be defined by the types... I recommenced to add ?s in your select for expressivness. So you can easily answer your question.
Here is the simpliest query for it :
select ?s (group_concat(?t) as ?fo)
where {
?s rdf:type ?t.
} group by (?s)
( order by(?fo) can be added if it is important for you ).
Secondly, the IN_HAS_NEXT is just a state of the query when you pause it in the monitor. Nothing to do with an error.
And finally, verify your query-timeout parameter. Its default value is set to 0.
Update :
Actually you have two errors in your LogFile. The problem is that your JVM is taking too long to Garbage Collector your memory (more than 98% of your memory is still used).
One solution is to increase your GC overhead limit and therefore allows to deal with your huge dataset.
As the other comments already have suggested the error is caused by OME. In the next upcoming GraphDB 8.6 release, the developers made a much more memory efficient implementation of all aggregates and distinct. Until its public release, there are only few options to test:
Decrease the amount of consumed memory by writing a slightly more optimal version of your query, which displays only the local names:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
select ?s_local (group_concat(?t_local) as ?fo)
where {
?s rdf:type ?t.
BIND (REPLACE(STR(?t), "^(.*)(/|#)([^#/]*)$", "$3") as ?t_local)
BIND (REPLACE(STR(?s), "^(.*)(/|#)([^#/]*)$", "$3") as ?s_local)
} group by (?s_local)
order by (?fo)
Increase the available RAM for performing all aggregate calculations. You can increase it by either passing a higher value of the -Xmx parameter or setting a minimal number for graphdb.page.cache.size in graphdb.properties like: graphdb.page.cache.size=100M. The cache controls the number of stored in memory pages, which for your query and repository size won't make much of a difference.
Limit the the maximum length of the group concat string with -Dgraphdb.engine.function.concat.max-length=4096. The limit won't make the query to execute successfully, but it will indicate, whether the problem is too many subjects too long strings.
The problem with GraphDB Free is: It increases heap size as per the workbench is called. For example, if you query or just refresh the workbench you could see the Heap size increases. Now even you do have large amount of memory, querying successively larger number of query shall produce OOM as the heap size is increasing continuously. The solution might be to look for temporary data stored by GraphDB on every call which increases the heap size.
Is it possible to generate a random sample of triples using SPARQL?
I thought it might be via the SAMPLE function but this returns a single SAMPLE.
My workaround would be to generate a random number to use with the OFFSET keyword and use the LIMIT keyword to return the desired sample size. I'll just hardcode the random number for offset to 200 for ease like so:
SELECT *
WHERE {
?s ?p ?o
}
OFFSET 200 #random number variable
LIMIT 100
Any better suggestions to generate a random sample of 100 data triples from a SPARQL endpoint?
In SPARQL 1.1 you can try to use
...} ORDER BY RAND() LIMIT 100
But whether this works might depend on the triple store.
The accepted answer may work but is not optimal as the suggested approach may lead to errors on some triple stores (e.g. on Apache Jena).
As pointed out in the ticket and by #Joshua-Taylor in the comment section above, a better answer is:
SELECT ... WHERE
{
...
BIND(RAND() AS ?sortKey)
} ORDER BY ?sortKey ...
We are performing a series of different SPARQL queries on a database containing ~5 million triples.
Our queries often cause a XDMP-MEMCANCELED-error though not consistently, they mostly return a correct result within a few seconds or less. Some queries occationally seems to hang and cause the server to run at 100% CPU until the query times out.
We have tried increasing what memory-related settings we could find. This query runs fine on other triple stores/engines.
We are running MarkLogic 8.0-11 AWS instance with 8 gb of internal memory and 16 gb of swap space.
Example of a fairly straightforward query that sometimes causes the error:
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX t5_m: <http://url.com/T5/model#>
PREFIX t5_d: <http://url.com/T5/data#>
SELECT DISTINCT
?_app_id
( ?_err as ?_reason )
?_comment
?_severity
WHERE
{
BIND ( 3 as ?_severity)
BIND ( "Generated by HL7 v2 Conformance Profile of IHE PCD-01 message" as ?_comment )
FILTER( ?_app_id = 'APP_ID')
FILTER ( ?_ts >= '2015-04-21T09:04:07.871' )
FILTER ( ?_ts <= '2015-04-21T09:07:43.973' )
?ACK t5_m:hasMSH ?MSH .
?MSH t5_m:hasMSH.5 ?MSH_5 .
?MSH_5 t5_m:hasHD.1 ?HD_1 .
?HD_1 t5_m:hD.1Value ?_app_id .
?ACK t5_m:hasMSA ?MSA .
?MSA t5_m:hasMSA.2 ?MSA_2 .
?MSA_2 t5_m:mSA.2Value ?_msg_id .
?PCD_01_Message a t5_m:PCD_01_Message .
?PCD_01_Message t5_m:id ?_msg_id .
?PCD_01_Message t5_m:timeStamp ?_ts .
?ACK t5_m:hasERR ?ERR .
?ERR t5_m:hasERR.7 ?ERR_7 .
?ERR_7 t5_m:eRR.7Value ?_err .
}
Is there some relevant configuration settings that we have missed or is there something wrong with this query?
Yours truly, Alexander
When the total hash join table size of all running SPARQL queries exceeds 50% of the host memory, the SPARQL query using the most memory for hash joins will be canceled with the "XDMP-MEMCANCELED" error. This may indicate a number of things:
The host is overloaded with the number of simultaneously
executing SPARQL queries. You could try adding more memory to the host
(8Gb is very small) or
load-balancing SPARQL queries across more MarkLogic hosts.
The SPARQL query optimizer is picking poor query plans for your
query. For SPARQL queries with larger numbers of joins (for instance
your query has 13 joins), you could try to execute the sem:sparql()
function with a higher optimization level, ie: try adding the option
"optimize=2".
The query optimizer may have a bug or deficiency that causes the
choice of a poor query plan. You should contact MarkLogic support to
look into this - they will be able to guide you to find out the
query plan chosen and the optimization parameters that lead to your
problem. They may be able to suggest a solution given more
information, or they may file a bug report for a fix to be made.
John
Given the expression
{WorkProduct: {$in:[0001,0002,0003,...]}
Is there a limit to the number of items I can query against ?
Our API forces no limit but the overall query we send to our underlying engine must be smaller than 4MB. Even allowing for overhead, theoretically, you could put a couple hundred thousand entries in an $in clause before hitting that.
So, that's the theory and absolute limit. I suspect you'll hit a practical limit long before then because $in clauses perform on the order of m * Log (n), where n is the number of elements in the collection and m is the number of elements in your $in clause. So, there will be a linear slow down with the number of elements in the $in clause.