How do I generate a random sample of data in SPARQL?

How do I generate a random sample of data in SPARQL? - sparql

Is it possible to generate a random sample of triples using SPARQL?
I thought it might be via the SAMPLE function but this returns a single SAMPLE.
My workaround would be to generate a random number to use with the OFFSET keyword and use the LIMIT keyword to return the desired sample size. I'll just hardcode the random number for offset to 200 for ease like so:
SELECT *
WHERE {
?s ?p ?o
}
OFFSET 200 #random number variable
LIMIT 100
Any better suggestions to generate a random sample of 100 data triples from a SPARQL endpoint?

In SPARQL 1.1 you can try to use
...} ORDER BY RAND() LIMIT 100
But whether this works might depend on the triple store.

The accepted answer may work but is not optimal as the suggested approach may lead to errors on some triple stores (e.g. on Apache Jena).
As pointed out in the ticket and by #Joshua-Taylor in the comment section above, a better answer is:
SELECT ... WHERE
{
...
BIND(RAND() AS ?sortKey)
} ORDER BY ?sortKey ...

Related

Wikidata Virtuoso SPARQL Endpoint - How to get more than 100,000 results

I need to get Wikidata artifacts (instance-types, redirects and disambiguations) for a project.
As the original Wikidata endpoint has time constraints when it comes to querying, I have come across Virtuoso Wikidata endpoint.
The problem I have is that if I try to get for example the redirects with this query, it only returns 100,000 results at most:
PREFIX owl: http://www.w3.org/2002/07/owl#
CONSTRUCT {?resource owl:sameAs ?resource2}
WHERE
{
?resource owl:sameAs ?resource2
}
I’m writing to ask if you know of any way to get more than 100,000 results. I would like to be able to achieve the maximum number of possible results.
Once the results are obtained, I must have 3 files (or as few files as possible) in the Ntriples format: wikidata_intance_types.nt, wikidata_redirecions.nt and wikidata_disambiguations.nt.
Thank you very much in advance.
All the best,
Jose Manuel

Please recognize that in both cases (Wikidata itself, and the Virtuoso instance provided by OpenLink Software, my employer), you are querying against a shared resource, and various limits should be expected.
You should space your queries out over time, and consider smaller chunks than the 100,000 limit you've run into -- perhaps 50,000 at a time, waiting for each query to finish retrieving results, plus another second or ten, before issuing the next query.
Most of the guidance in this article about working with the DBpedia public SPARQL endpoint is relevant for any public SPARQL endpoint, especially those powered by Virtuoso. Specific settings on other endpoints will vary, but if you try to be friendly — by limiting the rate of your queries; limiting the size of partial result sets when using ORDER BY, LIMIT, and OFFSET to step through to get a full result set for a query that overflows the instance's maximum result set size; and the like — you'll be far more successful.

You can get and host your own copy of wikidata as explained in
https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
There are also alternatives to get a partial dump of wikidata e.g. with https://github.com/bennofs/wdumper
Or ask for access to one of the non public copies we run by sending me a personal e-mail via my RWTH Aachen i5 account

How to get all person instances with its height from DBpedia?

I'm working on a project that displays celebs height
and Wikidata does not always provide the information needed.
So I considered DBpedia as another option and spent days trying to find out how to deal with it
how can I fetch every instance of Person class with its Height in
DBpedia, is it possible ?

You can use SPARQL to export the data from DBPedia (http://live.dbpedia.org/sparql/).
select * where {
?person dbo:Person\/height ?height.
}
LIMIT 10000 OFFSET 10000
The public endpoint limits the size of result sets, currently to 10000 rows, so you'll have to step through the data with an increasing OFFSET, or you find a better endpoint for your need.

Virtuoso 42000 Error The estimated execution time

Using DBpedia-Live SPARQL endpoint http://dbpedia-live.openlinksw.com/sparql, I am trying to count the total number of triples associated with the instances of type owl:Thing. As the count is really big, an exception is being thrown "Virtuoso 42000 Error The estimated execution time". To get rid of this I tried to use subselect, limit, and offset in the query. However, when the offset is greater than equal to the limit, the solution isn't working and the same exception is being thrown again (Virtuoso 42000 Error), can anyone please identify the problem with my query? Or suggest a workaround? Provided is the query I was trying:
select count(?s) as ?count
where
{
?s ?p ?o
{
select ?s
where
{
?s rdf:type owl:Thing.
}
limit 10000
offset 10000
}
}

Your solution starts with patience. Virtuoso's Anytime Query feature returns some results when a timeout strikes, and keeps running the query in the background -- so if you come back later, you'll typically get more solutions, up to the complete result set.
I had to guess at your original query, since you only posted the piecemeal one you were trying to use --
select ( count(?s) as ?count )
where
{
?s rdf:type owl:Thing.
}
I got 3,923,114 within a few seconds, without hitting any timeout. I had set a timeout of 3000000 milliseconds (= 3000 seconds = 50 minutes) on the form -- in contrast to the endpoint's default timeout of 30000 milliseconds (= 30 seconds) -- but clearly hit neither of these, nor the endpoint's server-side configured timeout.
I think you already understand this, but please do note that this count is a moving target, and will change regularly as the DBpedia-Live content continues to be updated from the Wikipedia firehose.
Your divide-and-conquer effort has a significant issue. Note that without an ORDER BY clause in combination with your LIMIT/OFFSET clauses, you may find that some solutions (in this case, some values of ?s) repeat and/or some solutions never appear in a final aggregation that combines all those partial results.
Also, as you are trying to count triples, you should probably do a count(*) instead of count (?s). If nothing else, this helps readers of the query understand what you're doing.
Toward being able to adjust such execution time limits as your query is hitting -- the easiest way would be to instantiate your own mirror via the the DBpedia-Live AMI; unfortunately, this is not currently available for new customers, for a number of reasons. (Existing customers may continue to use their AMIs.) We will likely revive this at some point, but the timing is indefinite; you could open a Support Case to register your interest, and be notified when the AMI is made available for new users.
Toward an ultimate solution... There may be better ways to get to your actual end goal than those you're currently working on. You might consider asking on the DBpedia mailing list or the OpenLink Community Forum.

How to resolve the execution limits in Linkedmdb

I was trying to extract all movies from Linkedmdb. I used OFFSET to make sure I wont hit the maximum number of results per query. I used the following scrip in python
"""
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX movie: <http://data.linkedmdb.org/resource/movie/>
SELECT distinct ?film
WHERE {
?film a movie:film .
} LIMIT 1000 OFFSET %s """ %i
I looped 5 times, with offsets being 0,1000,2000,3000,4000 and recorded the number of results. It was (1000,1000,500,0,0). I already knew the limit was 2500 but I thought by using OFFSET, we can get away with this.
Is it no true? There is no way to get all the data (even when we use a loop of some sort)?

Your current query is legal, but but there's no specified ordering, so the offset doesn't bring you to a predictable place in the results. (A lazy implementation could just return the same results over and over again.) When you use limit and offset, you need to also use order by. The SPARQL 1.1 specification says (emphasis added):
15.4 OFFSET
OFFSET causes the solutions generated to start after the specified
number of solutions. An OFFSET of zero has no effect.
Using LIMIT and OFFSET to select different subsets of the query
solutions will not be useful unless the order is made predictable by
using ORDER BY.

Search API - HTTP Query Argument Format

I've created a search API for a site that I work on. For example, some of the queries it supports are:
/api/search - returns popular search results
/api/search?q=car - returns results matching the term "car"
/api/search?start=50&limit=50 - returns 50 results starting at offset 50
/api/search?user_id=3987 - returns results owned by the user with ID 3987
These query arguments can be mixed and matched. It's implemented under the hood using Solr's faceted search.
I'm working on adding query arguments that can filter results based on a numeric attribute. For example, I might want to only return results where the view count is greater than 100. I'm wondering what the best practice is for specifying this.
Solr uses this way:
/api/search?views:[100 TO *]
Google seems to do something like this:
/api/search?viewsisgt:100
Neither of these seem very appealing to me. Is there a best practice for specifying this kind of query term? Any suggestions?

Simply use ',' as separator for from/to, it reads the best and in my view is intuitive:
# fixed from/to
/search?views=4,2
# upper wildcard
/search?views=4,
# lower wildcard
/search?views=,4
I take values inclusive. In most circumstances you won't need the exclusive/inclusive additional syntax sugar.
Binding it even works very well in some frameworks out of the box (like spring mvc), which bind ',' separated values to an array of values. You could then wrap the internal array with specific accessors (getMin(), getMax()).

Google's approach is good, why it's not appealing?
Here comes my suggestion:
/api/search?viewsgt=100

I think the mathematical notation for limits are suitable.
[x the lower limit can be atleast x
x] the upper limit can be atmost x
(x the lower limit must be strictly greater than x
x) the upper limit must be strictly lesser than x
Hence,
q=cats&range=(100,200) - the results from 100 to 200, but not including 100 and 200
q=cats&range=[100,200) - the results from 100 to 200, but the lower limit can be greater than 100
q=cats&range=[100 - any number from 100 onwards
q=cats&range=(100 - any number greater than 100
q=cats&range=100,200 - default, same as [100,200]
Sure, its aesthetics are still questionable, but it seems (IMO) the most intuitive for the human eye, and the parser is still easy.
As per http://en.wikipedia.org/wiki/Percent-encoding =,&,[,],(,) are reserved

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas