want to remove entity that has no label from the result - sparql

I'd like to ask one tricky thing about label. Using SERVICE keyword like SERVICE wikibase:label { bd:serviceParam wikibase:language "ko,en". } enable us to switch language label when the first preference is not mached to the target entity label.
However, I want to drop out some entities that does not have any label. However, the service keyword add entity with Qxxxx label when the entity does not have any language match label. How could I remove the entity from the result?
I know we can filter that out using rdfs:label for the all the variables explicitly but setting all the rdfs:label to all the variables is another headeache. So I'd like to know how to improve the query with SERVICE wikibase:label I want to filter out entity that doesn't have any label. Should I replace SERVICE with rdfs:label?
SELECT DISTINCT ?vLabel
WHERE {
hint:Query hint:optimizer "None" .
{
SELECT DISTINCT ?i {
?i wdt:P31 wd:Q515.
}LIMIT 15
}
?v wdt:P937 ?i.
SERVICE wikibase:label { bd:serviceParam wikibase:language "ko,en". }
}
LIMIT 3
RESULT:
Q59780594 <- no lang label
Q24642253 <- no lang label

The Wikidata label service doesn't provide a built-in way to skip resources that don't have a label.
The simplest option would be to wrap the query as a subquery into a new SELECT query, and use a filter to remove any Qxxxx labels. This uses the fact that only the real labels have a language tag:
SELECT ?vLabel {
{
SELECT DISTINCT ?vLabel
...
}
FILTER lang(?vLabel)
}
Edit: Below is my original (and inferior) answer, which used a regular expression on the label itself to remove the Qxxxx ones. It would also filter out any resources that actually have a label of the form Qxxxx, if such resources exist in Wikidata.
SELECT ?vLabel {
{
SELECT DISTINCT ?vLabel
...
}
FILTER (!REGEX(?vLabel, "^Q[0-9]+$"))
}

Related

Is it possible to formulate an OPTIONAL "subquery" so that it returns at most one record

The following Wikidata query returns a list of airports and their IATA codes.
I am using ?airport rdfs:label ?airportName to also get a label for the airports. Most airports have labels in multiple languages, so I want to select preferably the english name. Some airports have only the language en-ca and en-gb, but not en, so I cannot select them with lang(?airportName) = 'en'.
With the current implementation, I get multiple records for some airports:
select
?airport
?airportName
(lang(?airportName) as ?lang)
?IATAAirPortCode
{
?airport
wdt:P238
?IATAAirPortCode
optional {?airport rdfs:label ?airportName .
filter(langMatches(lang(?airportName), 'en')) }
}
order by
?IATAAirPortCode
I'd like to have one record per airport only. Is it somehow possible to formulate an optional { ... } clause to return at most one record of an airport.
For this style of query where you want a single rdfs:label value per result, you can use wikidata's wikibase:label SPARQL extension like this:
SELECT
?airport
?airportLabel
(LANG(?airportLabel) AS ?lang)
?IATAAirPortCode
{
?airport wdt:P238 ?IATAAirPortCode
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en"
}
}
ORDER BY ?IATAAirPortCode
The ?airportLabel variable is automatically bound to the label of each ?airport with only labels in the given preferred language (the language string "en" here can contain multiple, comma-separated acceptable language codes).
A more general-purpose solution that is portable SPARQL (without wikidata extensions) would be more complicated, and might differ depending on the specifics of the query. In this particular case, where your OPTIONAL is only adding one variable, you can do it without using the wikibase extension by using GROUP BY and SAMPLE aggregation:
SELECT
?airport
(SAMPLE(?airportLabel) AS ?airportName)
(LANG(?airportName) AS ?lang)
?IATAAirPortCode
{
?airport wdt:P238 ?IATAAirPortCode
OPTIONAL {
?airport rdfs:label ?airportLabel
FILTER(langMatches(lang(?airportLabel), 'en'))
}
}
GROUP BY ?airport ?IATAAirPortCode
ORDER BY ?IATAAirPortCode

How to retrieve only actual values

I have a SPARQL query where I try to retrieve all current german municipalities from wikidata, with some of their properties.
For example I try to retrieve their postal codes and parent regions:
SELECT DISTINCT ?region ?regionLabel ?postalCode ?parentLabel WHERE {
?region wdt:P31 wd:Q262166. # Municipalities
?region wdt:P17 wd:Q183. # from Germany
MINUS { ?region p:P576 _:anyValue. } # Only regions which exist today
OPTIONAL { ?region wdt:P281 ?postalCode. } # Select postal code
OPTIONAL { ?region wdt:P131 ?parent. } # Select administrative parents
SERVICE wikibase:label { bd:serviceParam wikibase:language "de" . } # Show german labels
}
As you can seen, I already found out how to exclude those municipalities which doesn't exist any more (because they have a property p:P576 = end date). I know it is a little bit fuzzy, because it could be an end date in the future (which is only determined already).
But more important, the postal codes and parents include "historical ones", which I would like to exclude. I know that I could do something like answered in "https://stackoverflow.com/questions/49066390/how-to-get-only-the-most-recent-value-from-a-wikidata-property", but the solution their is to bind the end date of the properties, which is usually not set for the current value. Despite the fact that I don't know how to build the query with two optional values.

How to filter a variable by a property included in the variable in SPARQL?

I want to write a SPARQL query that would return the first name of a person based on the ranking of the name on Wikidata.
For example, let's say I want the second first name of Mozart (Chrysostom).
This is what I have so far (Mozart Wikidata ID is Q254, first name's property is P735, with P1545 giving the ordinal position of the name):
SELECT DISTINCT ?full_name ?full_nameLabel ?first_nameLabel ?rank
WHERE
{
VALUES ?full_name {wd:Q254} .
?full_name p:P735 [pq:P1545 ?rank] ;
p:P735 [ps:P735 ?first_name] ;
FILTER regex(?rank, "2")
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
However here the filter only applies to the rank variable, and not on the first_name variable:
Query result:
I think that the problem comes from the fact that the rank property is a sub-element of the first_name property. Would you know of a way to filter the first_name variable by the rank variable?
SELECT DISTINCT ?id ?idLabel ?first_nameLabel ?rank
WHERE {
VALUES ?id {wd:Q254} .
?id p:P735 [
pq:P1545 ?rank;
ps:P735 ?first_name
]
FILTER(?rank = "2")
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}

Getting all the children of a Wikidata item (but not instances)

For example, take these three cases:
Triangle Shirtwaist Factory fire (Q867316) : instance of (P31): disaster (Q3839081)
disaster: subclass of (P279) : occurrence (Q1190554)
occurrence (Q1190554) : subclass of: temporal entity (Q26907166)
World's Fair (Q172754) : subclass of (P279) : exhibition (Q464980)
exhibition (Q464980) : subclass of (P279) : event (Q1656682)
event (Q1656682) : subclass of (P279) : occurrence (Q1190554)
occurrence (Q1190554) : subclass of: temporal entity (Q26907166)
Peloponnesian War (Q33745) : instance of (P31): war (Q198)
war (Q198) : subclass of (P279) : occurrence (Q1190554)
occurrence (Q1190554) : subclass of: temporal entity (Q26907166)
I would like all the descendants of temporal entity stopping before the instances (Triangle Shirtwaist Factory fire, World's Fair, Peloponnesian War).
Is there a way to do this with SPARQL or the API?
If I understand you correctly you just want to get the way instances are classified on Wikidata.
So starting with the example #AKSW gave:
SELECT DISTINCT ?event_type ?event_typeLabel {
?event_type wdt:P279* wd:Q26907166
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
The * is a pretty expensive operation to calculate and by the time of writing Wikidata has close to 50 million items. That is why I had to add the LIMIT because I was getting time-outs without it.
Graphing it
To get a feel for the data I like to look at it in the Wikidata graph builder. Because it shows clustering so nice.
https://angryloki.github.io/wikidata-graph-builder/?property=P279&item=Q26907166&iterations=2&mode=reverse
As you can see there are already a lot of classifications after 2 iterations. So we might also already be happy with this query:
SELECT DISTINCT ?event_type ?event_typeLabel {
?event_type wdt:P279/wdt:P279 wd:Q26907166
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Note that it only goes 2 times along the property P279. At the moment this gives me 281 items.
If you really need to traverse the tree in full you can filter out "instance of" (P31) statements using FILTER NOT EXISTS. But the problem is that that currently always runs into timeouts:
SELECT DISTINCT ?event_type ?event_typeLabel {
?event_type wdt:P279* wd:Q26907166 .
FILTER NOT EXISTS { ?event_type wdt:P31 [] }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 100
With a subquery you can limit the results from the tree, but you will get incomplete data:
SELECT ?event_type ?event_typeLabel
WHERE
{
{
SELECT DISTINCT ?event_type
WHERE
{
?event_type wdt:P279* wd:Q26907166 .
}
LIMIT 1000
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
FILTER NOT EXISTS { ?event_type wdt:P31 [] }
}

Find the most precise common superclasses between two items

I would like to find the first common superclass(es) between several Wikidata entities.
Let's take a bridge and a cemetery. What is their "smallest" common superclass?
A bridge is a subclass of "architectural structure".
A cemetery is a subclass of "place of worship", which is a subclass of "architectural structure".
---> Their most specialized common class is "architectural structure".
This Sparql query is close to the solution :
SELECT ?classe ?classeLabel WHERE {
wd:Q12280 wdt:P279* ?classe .
FILTER EXISTS { wd:Q39614 wdt:P279* ?classe .}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Problem: it returns all common classes between both items, not just the first ones. How could I filter the answer to get what I want?
If this question is of interest to someone else, here is the SPAQL query I finally use to get the Least common subsumers of more than two items. This is a mix between #AKSW response in comments and the answer to that previous question on SO.
SELECT ?lcs ?lcsLabel WHERE {
?lcs ^wdt:P279* wd:Q32815, wd:Q34627, wd:Q16970, wd:Q16560 .
filter not exists {
?sublcs ^wdt:P279* wd:Q32815, wd:Q34627, wd:Q16970, wd:Q16560 ;
wdt:P279 ?lcs .
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
Try it.