SPARQL DISTINCT gives duplicates in Virtuoso - sparql

The following SPARQL query is giving duplicates in Virtuoso even when the DISTINCT clause is used. You can test the query in the DBpedia public endpoint. Which is the problem with the query?
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbpedia:<http://dbpedia.org/resource/>
PREFIX dbpedia-owl:<http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
PREFIX vrank:<http://purl.org/voc/vrank#>
SELECT DISTINCT ?person1 ?person1_id ?person2 ?person2_id ?person2_rank
FROM <http://dbpedia.org>
FROM <http://people.aifb.kit.edu/ath/#DBpedia_PageRank>
WHERE {
?person1 rdf:type dbpedia-owl:Person.
?person2 rdf:type dbpedia-owl:Person.
?person1 ?link ?person2.
?person1 dbpedia-owl:wikiPageID ?person1_id.
?person2 dbpedia-owl:wikiPageID ?person2_id.
?person2 vrank:hasRank/vrank:rankValue ?person2_rank.
FILTER (?person1_id != ?person2_id).
FILTER (?person1_id = 308)
} ORDER BY DESC(?person2_rank) ASC(?person2_id)
SPARQL results
The results include rows that appear to be duplicates, e.g.:
http://dbpedia.org/resource/Aristotle 308 http://dbpedia.org/resource/Democritus 8211 27.281
http://dbpedia.org/resource/Aristotle 308 http://dbpedia.org/resource/Democritus 8211 27.281
http://dbpedia.org/resource/Aristotle 308 http://dbpedia.org/resource/Heraclitus 13792 26.6914
http://dbpedia.org/resource/Aristotle 308 http://dbpedia.org/resource/Heraclitus 13792 26.6914
http://dbpedia.org/resource/Aristotle 308 http://dbpedia.org/resource/Parmenides 23575 19.6082
http://dbpedia.org/resource/Aristotle 308 http://dbpedia.org/resource/Parmenides 23575 19.6082

I can confirm that it appears that there are duplicates in the results. I'm not absolutely sure what the issue with the duplicates is, but I wonder if it might have something do with the inexact equality for floating point numbers. If, instead of selecting the floating point numbers directly, you select their lexical forms with (note the (str(...) as ?rank) at the end):
SELECT DISTINCT
?person1 ?person1_id
?person2 ?person2_id
(str(?person2_rank) as ?rank)
I get none of the duplicates. This might be worth reporting to the Virtuoso folks as a bug. For what it's worth, if you want floating point values for rank, you can use xsd:float as a function to turn that string back into a floating point value, and when I do that, with the select like the following, I still get the expected distinct results.
SELECT DISTINCT
?person1 ?person1_id
?person2 ?person2_id
(xsd:float(str(?person2_rank)) as ?rank)
SPARQL results

Although this won't help with your dbpedia query, anyone who arrived here via a search on the title who has control of the model and data may want to know that:
virtuoso double does not seem to suffer from this SELECT DISTINCT problem that occurs with float

Related

Dbpedia sparql gives empty result to my query, what is wrong?

I'm doing sparql query in this site.
It gives me an empty result, what is wrong with my query?
prefix foaf: <http://xmlns.com/foaf/0.1/>
select * where {
?s rdf:type foaf:Person.
} LIMIT 100
This query is ok, but when I add the second pattern, I got empty result.
?s foaf:name 'Abraham_Robinson'.
prefix foaf: <http://xmlns.com/foaf/0.1/>
select * where {
?s rdf:type foaf:Person.
?s foaf:name 'Abraham_Robinson'.
} LIMIT 100
How to correct my query so the result includes this record:
http://dbpedia.org/resource/Abraham_Robinson
I guess Kingsley misread this as a freetext search, rather than a simple string comparison.
The query Kingsley posted and linked earlier delivers no solution, for the same reasons as the original query failed, as identified in the comment by AKSW, i.e. --
foaf:name values don't generally replaces spaces with underscores as in the original value; i.e., 'Abraham_Robinson' should have been 'Abraham Robinson'
foaf:name strings are typically langtagged, and it is in this case, so that actually needs to be 'Abraham Robinson'#en
Incorporating AKSW's fixes with this line ?s foaf:name 'Abraham Robinson'#en., the query works.
All that said -- you may prefer an alternative query, which will deliver results whether or not the foaf:name value is langtagged and whether or not the spaces are replaced by underscores. This one is Virtuoso-specific, and produces results faster because the bif:contains function uses its free-text indexes, would be --
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT *
WHERE
{
?s rdf:type foaf:Person ;
foaf:name ?name .
?name bif:contains "'Abraham Robinson'" .
}
LIMIT 100
Generic SPARQL using a REGEX FILTER works against both Virtuoso and other RDF stores, but produces results more slowly because REGEX does not leverage the free-text indexes, as in --
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT *
WHERE
{
?s rdf:type foaf:Person ;
foaf:name ?name .
FILTER ( REGEX ( ?name, 'Abraham Robinson' ) ) .
}
LIMIT 100
The following query should work. Right now it isn't working due to a need to refresh the text index associated with this instance (currently in progress):
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * where {
?s rdf:type foaf:Person.
?s foaf:name "'Abraham Robinson'".
}
LIMIT 100
Note how the phrase is placed within single-quotes that are within double-quotes.
If the literal content is language-tagged, as is the case in DBpedia the exact-match query would take the form (already clarified in #TallTed's response):
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT * where {
?s rdf:type foaf:Person.
?s foaf:name 'Abraham Robinson'#en.
}
LIMIT 100
This SPARQL Results Page Link should produce a solution when the index update completes.

SPARQL query cannot find page that should exist

I am writing a query in DBpedia live http://dbpedia-live.openlinksw.com/sparql/ to return details of well known people. As a test case, I know there exists a page for the Roman Emperor Nero at http://dbpedia.org/page/Nero but my query does not return a row for him. My SPARQL query is:
# Runs in live (http://dbpedia-live.openlinksw.com/sparql/)
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?x0 ?name2 ?dob
WHERE {
?x0 rdf:type foaf:Person.
?x0 rdfs:label ?name.
?x0 dbo:birthDate ?dob.
FILTER REGEX(?name,"^Ner.*","i").
BIND (str(?name) AS ?name2)
} ORDER BY ?name2 LIMIT 100
I checked the page http://dbpedia.org/page/Nero and it has the rdf:type, rdfs.label and dbo.birthDate properties referenced in my query. When I run the above query it returns 100 other people with name beginning "Ner" using the case insensitive version of REGEX.
So my question is: Why does the above query not return Nero in the list of results?
First thing, note that on DBpedia, Nero (currently) comes in result rows 109 and 110, so you need at least 110 rows to see him.
Now, note that Nero's description on DBpedia-Live is different than on DBpedia -- and that neither is correct.
Of particular note, see that on DBpedia-Live, Nero's http://dbpedia.org/property/birthDate has one value -- 1937-12-15 (xsd:date) -- while on DBpedia, http://dbpedia.org/ontology/birthDate has two values -- 0037-12-15 (xsd:date) and 1937-12-15 (xsd:date).
Also note that on DBpedia-Live, Nero has no http://dbpedia.org/ontology/birthDate value -- while on DBpedia, http://dbpedia.org/property/birthDate has no value.
These results should be of interest to you -- on DBpedia, and on DBpedia-Live.

SPARQL query returns multiple birth dates for same person

I am learning SPARQL and dbpedia by working through the queries in https://www.joe0.com/2014/09/22/how-to-use-sparql-to-query-dbpedia-and-freebase/ . I am testing a query to return John Lennon's date of birth and I am running my queries in http://dbpedia.org/sparql . The query is:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?x0 ?x1 WHERE {
?x0 rdf:type foaf:Person.
?x0 rdfs:label "John Lennon"#en.
?x0 dbpedia-owl:birthDate ?x1.
}
It returns two rows containing the same date (9 Oct 1940). My question is: why does the query return two rows even though it uses DISTINCT? Prior to asking this question I checked the following:
Why does my SPARQL query duplicate results?
Duplicate rows when making SPARQL queries
but I don't think they explain the duplicate dates.
Edit: I converted the results to text and pasted them below
-------------------------------------- -----------------------------------------------------
x0 x1
--------------------------------------- -----------------------------------------------------
http://dbpedia.org/resource/John_Lennon 1940-10-09
http://dbpedia.org/resource/John_Lennon "1940-10-9"^^<http://www.w3.org/2001/XMLSchema#date>
As stated it seems dbpedia actually has two dates, 1940-10-09 (valid) and 1940-10-9 (invalid). The answer is to add a FILTER that converts the date to a string and only allows dates conforming to YYYY-MM-DD. Anyway it works!
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpedia-owl: <http://dbpedia.org/ontology/>
SELECT DISTINCT ?x0 ?x1 STR(?x1) WHERE {
?x0 rdf:type foaf:Person.
?x0 rdfs:label "John Lennon"#en.
?x0 dbpedia-owl:birthDate ?x1.
FILTER (REGEX(STR(?x1),"[0-9]{4}-[0-9]{2}-[0-9]{2}")).
}
Well, it is not your fault! Simply the resource has both of these triples as you can see here. There are duplicates in the data.
I ran your query on the DBpedia endpoint and asked for the results in an RDF-based format (Turtle), and found that the lexical forms of the date literals are actually different:
"1940-10-09"^^xsd:date
"1940-10-9"^^xsd:date
The second isn't actually a legal xsd:date. The first is, which is probably why the SPARQL endpoint prints it in "pretty" fashion in the HTML table (as just 1940-10-09).
The result is a slowdown on queries because each access to an invalid date trig an exception (for example, with a query from fuseki) or the filter do the job to eliminate the wrong date, but it's costly

Find people linked with Person X using Sparql and DBpedia

I am new to querying DBPedia using Sparql. I would like to find people related to a person X using DBPedia.
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbpr: <http://dbpedia.org/resource/>
PREFIX dbpo: <http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
SELECT ?P
WHERE {dbpr:Chuck_Norris}
In case you want the people directly related to a person, then you need two triple patterns, one having the person as a subject and another as an object. Here is one way to construct such a query:
SELECT DISTINCT ?has ?of ?person
WHERE {
{?person a foaf:Person;
?of dbr:Chuck_Norris}
UNION
{?person a foaf:Person.
dbr:Chuck_Norris ?has ?person}
}
I'm using dbr: as this is a predefined prefix in DBpedia.
In some cases, you'll get different results if you query on http://live.dbpedia.org/sparql
Now, this query has a limitation. It will get you only relations to other DBpedia resources. Quite often there might be related persons asserted as literals. I don't know of an elegant way to get them, but one way would be to ask for a list of properties that are known to be used for person relations. Here is an example with a list with one value for dbp:spouse:
SELECT DISTINCT ?has ?of ?person
WHERE {
{?person a foaf:Person;
?of dbr:Chuck_Norris}
UNION
{?person a foaf:Person.
dbr:Chuck_Norris ?has ?person}
UNION
{values ?has {dbp:spouse}
dbr:Chuck_Norris ?has ?person .}
}
From http://live.dbpedia.org/sparql this will bring you additionally:
has | person
---------------------------------------------------------
http://dbpedia.org/property/spouse | "Dianne Holechek"#en
http://dbpedia.org/property/spouse | "Gena O'Kelley"#en
---------------------------------------------------------
In general asking for literals like that will bring you a lot of garbage results, which you can reduce using string filters.

Openlink Virtuoso SPARQL OFFSET and LIMIT behavior

The following SPARQL query returns 20 results. I was expecting 10 given the OFFSET and LIMIT
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX dbpedia:<http://dbpedia.org/resource/>
PREFIX dbpedia-owl:<http://dbpedia.org/ontology/>
PREFIX dbpprop: <http://dbpedia.org/property/>
SELECT ?person_id ?person2_id
WHERE {
{
SELECT DISTINCT ?person_id ?person2_id WHERE {
?person rdf:type dbpedia-owl:Person .
?person2 rdf:type dbpedia-owl:Person .
?person ?link ?person2 .
?person dbpedia-owl:wikiPageID ?person_id .
?person2 dbpedia-owl:wikiPageID ?person2_id .
FILTER (?link = dbpedia-owl:wikiPageWikiLink) .
} ORDER BY ?link
}
} OFFSET 10 LIMIT 10
I execute the code in the SPARQL endpoint of an OpenLink Virtuoso Server.
What is the problem with the query?
The clause that makes the query behave weird is ORDER BY ?link. Replacing it with ORDER BY ?person_id all works as expected. It makes still no sense to me but I am a newbie using SPARQL too.
#jordipala said
The clause that makes the query behave weird is ORDER BY ?link. Replacing it with ORDER BY ?person_id all works as expected. It makes still no sense to me but I am a newbie using SPARQL too.
Part of the issue is that the ?link values aren't string literals, though they may appear to be if you include that variable in the SELECT clauses. (Also note that the ?link values are all the same for these solutions, so you definitely need to put something else into the ORDER BY such that it does the desired job of preventing both duplicate solutions and omitted solutions.)
Also, counterintuitively given the ?link datatype, the numeric-appearing person_id and person2_id are not typed as numbers -- they're strings, unless you force their datatype.
If you simply change ?link to str(?link) in the ORDER BY, the query will deliver the desired 10 rows -- and you may note that all the ?link values are identical! -- and if you include person_id and person2_id in the ORDER BY (done in my following links by using the ordinals of the SELECT variables, because of where and how I've coerced the datatypes), you'll get more useful output, as in this query and these results