Find the most precise common superclasses between two items - sparql

I would like to find the first common superclass(es) between several Wikidata entities.
Let's take a bridge and a cemetery. What is their "smallest" common superclass?
A bridge is a subclass of "architectural structure".
A cemetery is a subclass of "place of worship", which is a subclass of "architectural structure".
---> Their most specialized common class is "architectural structure".
This Sparql query is close to the solution :
SELECT ?classe ?classeLabel WHERE {
wd:Q12280 wdt:P279* ?classe .
FILTER EXISTS { wd:Q39614 wdt:P279* ?classe .}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Problem: it returns all common classes between both items, not just the first ones. How could I filter the answer to get what I want?

If this question is of interest to someone else, here is the SPAQL query I finally use to get the Least common subsumers of more than two items. This is a mix between #AKSW response in comments and the answer to that previous question on SO.
SELECT ?lcs ?lcsLabel WHERE {
?lcs ^wdt:P279* wd:Q32815, wd:Q34627, wd:Q16970, wd:Q16560 .
filter not exists {
?sublcs ^wdt:P279* wd:Q32815, wd:Q34627, wd:Q16970, wd:Q16560 ;
wdt:P279 ?lcs .
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it.

Related

How to get *all* superclasses of a Wikidata entity with SPARQL?

I am interested in visualizing the Wikidata class hierarchy to create graphs like
I know how I can get direct superclasses of a Wikidata entity. For this I use SPARQL code like:
SELECT ?item ?itemLabel
WHERE
{
wd:Q125977 wdt:P279 ?item.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
where wdt:P279 denotes the subclass of-property.
However, this direct method requires many single requests to the Wikidata API.
How is it possible to get the same information with a single SPARQL query?
(Note that the example graph above only shows an abbreviated version. The final desired graph of all superclasses is 13 levels deep and has 69 nodes which means 68 single requests, see this jupyter notebook if interested.)
You could use a query like this to create your taxonomy (with labels) as triples directly.
CONSTRUCT {
?item1 wdt:P279 ?item2.
?item1 rdfs:label ?item1Label.
?item2 rdfs:label ?item2Label.
}
WHERE {
SELECT ?item1 ?item2 ?item1Label ?item2Label
WHERE {
wd:Q125977 (wdt:P279*) ?item1, ?item2.
FILTER(EXISTS { ?item1 wdt:P279 ?item2. })
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
}
I think you need a query like the following:
SELECT ?class ?classLabel ?superclass ?superclassLabel
WHERE
{
wd:Q125977 wdt:P279* ?class.
?class wdt:P279 ?superclass.
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
where wdt:P279* is a zero-or-more path connecting a class with (a superclass of) one of its superclasses.
This will generate a mapping "class->superclass" containing all what you need for building the graph that you illustrated.

How to retrieve the categorical details in Wikidata

I have a list of instances as follows.
myinstances = ['word2vec', 'tf-idf', 'dijkstra's algorithm']
For each myinstance in the above list, I want to find:
1. What are the other instances of `myinstance`'s category (i.e. only one hop)
2. What are the instances of `myinstance`'s category's category (i.e. two hops)
For example, if we consider myinstance = word2vec
What are the other instances of myinstance's category (i.e. only one hop)?
As shown in the figure below the other instances of its immediade ancestor is GloVe.
What are the instances of myinstance's category's category (i.e. two hops)? In other words, what are the instances of embedding (which is two hops away from word2vec) as shown the below figure.
I am just wondering if such query searches can be performed in sparql?
My current code is as follows.
SELECT * {
VALUES ?searchTerm { "word2vec" "tf-idf" "dijkstra's algorithm" }
SERVICE wikibase:mwapi {
bd:serviceParam wikibase:api "EntitySearch".
bd:serviceParam wikibase:endpoint "www.wikidata.org".
bd:serviceParam wikibase:limit 1 .
bd:serviceParam mwapi:search ?searchTerm.
bd:serviceParam mwapi:language "en".
?item wikibase:apiOutputItem mwapi:item.
?num wikibase:apiOrdinal true.
}
?item (wdt:P279|wdt:P31) ?type
}
I am happy to provide more details if needed.

want to remove entity that has no label from the result

I'd like to ask one tricky thing about label. Using SERVICE keyword like SERVICE wikibase:label { bd:serviceParam wikibase:language "ko,en". } enable us to switch language label when the first preference is not mached to the target entity label.
However, I want to drop out some entities that does not have any label. However, the service keyword add entity with Qxxxx label when the entity does not have any language match label. How could I remove the entity from the result?
I know we can filter that out using rdfs:label for the all the variables explicitly but setting all the rdfs:label to all the variables is another headeache. So I'd like to know how to improve the query with SERVICE wikibase:label I want to filter out entity that doesn't have any label. Should I replace SERVICE with rdfs:label?
SELECT DISTINCT ?vLabel
WHERE {
hint:Query hint:optimizer "None" .
{
SELECT DISTINCT ?i {
?i wdt:P31 wd:Q515.
}LIMIT 15
}
?v wdt:P937 ?i.
SERVICE wikibase:label { bd:serviceParam wikibase:language "ko,en". }
}
LIMIT 3
RESULT:
Q59780594 <- no lang label
Q24642253 <- no lang label
The Wikidata label service doesn't provide a built-in way to skip resources that don't have a label.
The simplest option would be to wrap the query as a subquery into a new SELECT query, and use a filter to remove any Qxxxx labels. This uses the fact that only the real labels have a language tag:
SELECT ?vLabel {
{
SELECT DISTINCT ?vLabel
...
}
FILTER lang(?vLabel)
}
Edit: Below is my original (and inferior) answer, which used a regular expression on the label itself to remove the Qxxxx ones. It would also filter out any resources that actually have a label of the form Qxxxx, if such resources exist in Wikidata.
SELECT ?vLabel {
{
SELECT DISTINCT ?vLabel
...
}
FILTER (!REGEX(?vLabel, "^Q[0-9]+$"))
}

Getting all the children of a Wikidata item (but not instances)

For example, take these three cases:
Triangle Shirtwaist Factory fire (Q867316) : instance of (P31): disaster (Q3839081)
disaster: subclass of (P279) : occurrence (Q1190554)
occurrence (Q1190554) : subclass of: temporal entity (Q26907166)
World's Fair (Q172754) : subclass of (P279) : exhibition (Q464980)
exhibition (Q464980) : subclass of (P279) : event (Q1656682)
event (Q1656682) : subclass of (P279) : occurrence (Q1190554)
occurrence (Q1190554) : subclass of: temporal entity (Q26907166)
Peloponnesian War (Q33745) : instance of (P31): war (Q198)
war (Q198) : subclass of (P279) : occurrence (Q1190554)
occurrence (Q1190554) : subclass of: temporal entity (Q26907166)
I would like all the descendants of temporal entity stopping before the instances (Triangle Shirtwaist Factory fire, World's Fair, Peloponnesian War).
Is there a way to do this with SPARQL or the API?
If I understand you correctly you just want to get the way instances are classified on Wikidata.
So starting with the example #AKSW gave:
SELECT DISTINCT ?event_type ?event_typeLabel {
?event_type wdt:P279* wd:Q26907166
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
The * is a pretty expensive operation to calculate and by the time of writing Wikidata has close to 50 million items. That is why I had to add the LIMIT because I was getting time-outs without it.
Graphing it
To get a feel for the data I like to look at it in the Wikidata graph builder. Because it shows clustering so nice.
https://angryloki.github.io/wikidata-graph-builder/?property=P279&item=Q26907166&iterations=2&mode=reverse
As you can see there are already a lot of classifications after 2 iterations. So we might also already be happy with this query:
SELECT DISTINCT ?event_type ?event_typeLabel {
?event_type wdt:P279/wdt:P279 wd:Q26907166
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Note that it only goes 2 times along the property P279. At the moment this gives me 281 items.
If you really need to traverse the tree in full you can filter out "instance of" (P31) statements using FILTER NOT EXISTS. But the problem is that that currently always runs into timeouts:
SELECT DISTINCT ?event_type ?event_typeLabel {
?event_type wdt:P279* wd:Q26907166 .
FILTER NOT EXISTS { ?event_type wdt:P31 [] }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
With a subquery you can limit the results from the tree, but you will get incomplete data:
SELECT ?event_type ?event_typeLabel
WHERE
{
{
SELECT DISTINCT ?event_type
WHERE
{
?event_type wdt:P279* wd:Q26907166 .
}
LIMIT 1000
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
FILTER NOT EXISTS { ?event_type wdt:P31 [] }
}

Filter by type in Wikidata

This SPARQL request looks for all cities called "Berlin" in Wikidata:
SELECT DISTINCT ?item ?itemLabel ?itemDescription WHERE {
?type (a | wdt:P279) wd:Q515. # Sub-type of city
?item wdt:P31 ?type.
?item rdfs:label "Berlin"#en.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
PROBLEM: It returns zero result.
Meanwhile, the request below correctly finds Q64 (capital and city-state of Germany), but it also returns a lot of other things called Berlin, so I want to filter on cities (then in a future phase I will order these cities by population, but that is outside the scope of this question):
SELECT DISTINCT ?item ?itemLabel ?itemDescription WHERE {
?item rdfs:label "Berlin"#en.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Note: My code for getting instances of subclasses of city (Berlin is a big city which is subclass of city) seems to work correctly, as illustrated by the results of this query.
It was a Wikidata bug.
According to Wikidata's Jura1, it was a bug in Wikidata caused by someone's experiments with "preferred rank".
Discussion at https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2016/09#P31_inconsistency
The bug has been fixed just now.
You can only query for data that is contained in the dataset.
If you try an alternative of your query
SELECT DISTINCT ?item ?itemLabel ?itemDescription ?type1 ?type2 WHERE {
?item rdfs:label "Berlin"#en.
optional{?item rdf:type ?type1 }
optional{?item wdt:P279 ?type2 }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
it returns no types, neither connected by rdf:type nor wdt:P279.
If you have a look at the entity of the capital and city state Berlin, you can see that there is information about "instance of", but this property is supposed to be https://www.wikidata.org/wiki/Property:P31. And none of them links to wd:Q515, I'm wondering from where you got this idea.
But to be honest, I don't know that much about Wikidata and to me, it's not clear why no rdf:type is used, but a common pattern for RDF datasets is to use
?s rdf:type/rdfs:subClassOf* SUPER_CLASS .
if we assume that there is rdf:type information available.
If you check the types wd:Q64 is an instance of
SELECT DISTINCT ?type ?typeLabel WHERE {
wd:Q64 (a | wdt:P31) ?type.
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ?item
None of them are City (wd:Q515) or a sub-class of it.
Looks like a data issue. Perhaps you should contact Wikidata.