Wikidata: an effective way to count items that share two properties - sparql

I would like to count the number of Wikidata items that have two properties at the same time. For example, a Viaf ID and a BNF ID, or a LoC Id and a SUDOC id. The first way that comes to my mind would be a query like this:
SELECT (COUNT(DISTINCT ?item) AS ?count) WHERE {
?item wdt:P214 ?viaf.
?item wdt:P268 ?bnf.
}
Try it.
But this query is inefficient (23 seconds) and, to apply it to 10 properties, would require 90 comparisons two by two. Is there a more efficient way to perform these calculations?

Related

What order does Wikidata's SPARQL endpoint use when there is no `ORDER BY` clause?

The following query loads twenty cities from Wikidata, together with the country or state of which they are the capital:
SELECT ?item ?capitalOf WHERE {
?item wdt:P31/wdt:P279* wd:Q515.
OPTIONAL { ?item wdt:P1376 ?capitalOf. }
}
LIMIT 20 OFFSET 0
The results are Q60 (New York), Q62 (San Francisco), Q64 (Berlin), Q84 (London) etc.
Now set the OFFSET parameter to 1, and you get the same list starting at Q62. Index 0 is omitted, as expected. With OFFSET set to 2, you get the same list starting at index 2.
However, sometimes you get a completely different result. For example, I just got the list Q2807, Q2861, Q2900 ... when using OFFSET 70, but this list didn't overlap the list from OFFSET 60. It seems that there is some randomness in the LIMIT and OFFSET query clauses.
What is the default sort order of SPARQL results?
The reason I'm asking: We are using some queries that need to be loaded with LIMIT and ORDER BY, because the number of results is so big. Moreover, these queries run into timeouts when using an ORDER BY.
Just as with SQL, there is no default sort order of SPARQL results.
The SPARQL processor is allowed to return solutions in any order, which may (but typically will not) vary with each execution of the query.
The only way to be certain of the order of solutions is to include an ORDER BY clause.
When running expensive (e.g., long-running) queries, the best way to address this is to set up your own local repository/processor/endpoint, instead of using a public endpoint, such as that at wikidata.org.

get a variable number of columns for output in sparql

Is there a way to get a variable number of columns for a given predicate? Essentially, I want to turn this:
title note
A. 1
A. 2
A. 3
B. 4
B. 5
into
title note1 note2 note3
A. 1 2 3
B. 4 5 null
Like, can i set the columns created to the maximum number of "notes" in the query or something. Thanks.
There are several ways you can approach this. One way is to change your query. Now, in the general case it is not possible to do a SELECT query that does exactly what you want. However, if you happen to know in advance what the maximum number of notes per title is, you can sort of do this.
Supposing your original query was something like this:
SELECT ?title ?note
WHERE { ?title :hasNote ?note }
And supposing you know titles have at most 3 notes, you could probably (untested) do something like this:
SELECT ?title ?note1 ?note2 ?note3
WHERE {
?title :hasNote ?note1 .
OPTIONAL { ?title :hasNote ?note2 . FILTER (?note2 != ?note1) }
OPTIONAL { ?title :hasNote ?note3 . FILTER (?note3 != ?note1 && ?note3 != ?note2) }
}
As you can see this is not a very nice solution though: it doesn't scale and is probably very inefficient to process as well.
Alternatives are various forms of post-processing. To make it simpler to post-process you could use an aggregate operator to get all notes for a single item on a single line at least:
SELECT ?title (GROUP_CONCAT(?note) as ?notes)
WHERE { ?title :hasNote ?note }
GROUP BY ?title
result:
title notes
A. "1 2 3"
B. "4 5"
You could then post-process the values of the ?notes variable to split them into the separate notes again.
Another solution is that instead of using a SELECT query, you use a CONSTRUCT query to give you back an RDF graph, rather than a table, and work directly with that in your code. Tables are kinda weird in an RDF world if you think about it: you're querying a graph model, why is the query result not a graph but a table?
CONSTRUCT
WHERE { ?title :hasNote ?note }
...and then process the result in whatever API you're using to do the queries.

SPARQL - Find objects with the most similar properties

Lets say there is a RDF DB of people and each of these people has many triples defining this person's friends (so many of 'person' x:hasFriend 'otherPerson'). How can I find people who have the most similar friends? I'm a novice at SPARQL and this seems like a really complex query.
Basically, the results would be a list of people starting from the ones with the most similar friends list (to the person specified in the query) and then going down the list to the people with the least similar friends list.
So lets say I search this query for person1, the results would be something like:
person2 - 300 of the same friends
person30 - 245 of the same friends
person18 - 16 of the same friends
etc.
If you adapt the approach in my answer to How to find similar content using SPARQL (of which this might be considered a duplicate), you'd end up with something like:
select ?otherPerson (count(?friend) as ?numFriends) where {
:person1 :hasFriend ?friend . #-- person1 has a friend, ?friend .
?otherPerson :hasFriend ?friend . #-- so does ?otherPerson .
}
group by ?otherPerson #-- One result row per ?otherPerson,
order by desc(?numFriends) #-- ordered by number of common friends.
If you really want to, you can use a reverse property path to make the query pattern a little shorter:
select ?otherPerson (count(?friend) as ?numFriends) where {
#-- some ?friend is a friend of both :person1 and ?otherPerson .
?friend ^:hasFriend :person1, ?otherPerson .
}
group by ?otherPerson #-- One result row per ?otherPerson,
order by desc(?numFriends) #-- ordered by number of common friends.

How to force virtuoso sparql endpoint return full answer?

I want to query DBpedia and use Virtuoso. In some queries which their results are too much, it returns only part of the results. For example, in the query below, the predicate http://dbpedia.org/ontology/birthplace is missing. Is there any way to get all results either from Virtuoso or any other endpoint ?
SELECT DISTINCT ( ?p AS ?outEdge )
( ?q AS ?inEdge )
( ?px AS ?dest )
( ?qx AS ?source )
WHERE {
{ <http://dbpedia.org/resource/England> ?p ?px . }
UNION
{ ?qx ?q <http://dbpedia.org/resource/England> . }
}
I want to query DBPeida and use virtuoso. In some queries which their results are too much it returns only part of the results for example in the below query the predicate http://dbpedia.org/ontology/birthplace is missing. Is there anyway to get all results either from virtuoso or any other endpoint ?
While I don't detect anything malicious or mean-spirited in your question, you're essentially asking how circumvent DBpedia's defenses against intentional and unintentional denial of service attacks. Internal limits help to ensure that too many resources aren't consumed by any particular query. The right way to get all the results from a SPARQL query, if they aren't all returned at once, is to use limit, offset, and order by, and to use multiple queries. E.g.,
#-- get first 10 results
select ... where ...
order by ?name
limit 10 offset 0
#-- get next 10 results
select ... where ...
order by ?name
limit 10 offset 10
#-- get more resuls
select ... where ...
order by ?name
limit 10 offset 20

How to form SPARQL queries that refers to multiple resources

My question is a followup with my first question about SPARQL here.
My SPARQL query results for Mountain objects are here.
From those results I picked a certain object resource.
Now I want to get values of "is dbpedia-owl:highestPlace of" records for this chosen Mountain object.
That is, names of mountain ranges for which this mountain is highest place of.
This is, as I figure, complex. Not only because I do not know the required syntax, but also I get two objects here.
One of them is Mont Blank Massif which is of type "place".
Another one is Western Alps which is of type "mountain range" - my desired record.
I need record # 2 above but not 1. I know 1 is also relevant but sometimes it doesn't follow same pattern. Sometimes the records appear to be of YAGO type, which can be totally misleading. To be safe, I simply want to discard those records whenever there is type mismatch.
How can I form my SPARQL query to get these "is dbpedia-owl:highestPlace of" records and also have the type filtering?
you can use this query, note however that Mont_Blanc_massif in your example is both a dbpedia-owl:Place and a dbpedia-owl:MountainRange
select * where {
?place dbpedia-owl:highestPlace :Mont_Blanc.
?place rdf:type dbpedia-owl:MountainRange.
}
run query
edit after comment: filter
It is not really clear what you want to filter (yago?), technically you can filter for example like this:
select * where {
?place dbpedia-owl:highestPlace :Mont_Blanc.
?place rdf:type dbpedia-owl:MountainRange.
FILTER NOT EXISTS {
?place ?pred ?obj
Filter (regex(?obj, "yago"))
}
}
this filters out results that have any object with 'yago' in its URL.
Extending the result from the previous answer, the appropriate query would be
select * where {
?mountain a dbpedia-owl:Mountain ;
dbpedia-owl:abstract ?abstract ;
foaf:depiction ?depiction .
?range a dbpedia-owl:MountainRange ;
dbpedia-owl:highestPlace ?mountain .
FILTER(langMatches(lang(?abstract),"EN"))
}
LIMIT 10
SPARQL Results
This selects mountains with English abstracts that have at least one depiction (or else the pattern wouldn't match) and for which there is some mountain range of which the mountain is the highest place. Without the parts from the earlier question, if you just want to retrieve mountains that are the highest place of a range, you can use a query like this:
select * where {
?mountain a dbpedia-owl:Mountain .
?range a dbpedia-owl:MountainRange ;
dbpedia-owl:highestPlace ?mountain .
}
LIMIT 10
SPARQL results