Wikidata query duplicates

Wikidata query duplicates - sparql

Sorry if my english is bad, but I don't really have any place where I can ask this question in my native language.
I've been trying to create SPARQL query for Wikidata that should create a list of all horror fiction that was created in 1925-1950 years, names of authors and, if available, pictures:
SELECT DISTINCT ?item ?itemLabel ?author ?name ?creation ?picture
WHERE
{
?item wdt:P136 wd:Q193606 . # book
?item wdt:P50 ?author . # author
?item wdt:P577 ?creation .
?item wdt:P577 ?end .
?author rdfs:label ?name .
OPTIONAL{ ?item wdt:P18 ?picture }
FILTER (?creation >= "1925-01-01T00:00:00Z"^^xsd:dateTime) .
FILTER (?end <= "1950-12-31T23:59:59Z"^^xsd:dateTime) .
SERVICE wikibase:label
{
bd:serviceParam wikibase:language "en" .
}
}
However, for some reason this query placing duplicates in the list. DISTINCT doesn't do much. After some time I figured out that the reason is "?item rdfs:label ?name .". If this line is removed, no duplicates are listed. But I need this line to show author name in the list!
Any ideas on how to fix this?

You don't need to use ?item rdfs:label ?name . as you already get items labels as ?itemLabel thank to SERVICE wikibase:label.
Then, you will get duplicate results for every items that have a SELECTed property with possibly multiple values: here, you are SELECTing authors (P50), which will create duplicates for every item with several authors.

The query is actually giving you distinct items. The problem is that some items have multiple rdfs:labels. You can see as an example the item:
SELECT *
WHERE
{
wd:Q2882840 rdfs:label ?label
SERVICE wikibase:label
{
bd:serviceParam wikibase:language "en" .
}
}
And since there are multiple rdfs:label predicates for some items, they are showing up in separate rows.

You can aggregate your results according to the book title (the item's label) using the
group by
keyword.
Thus, every result will be a group which will show up once, and other fields which have different values, will be aggregated using the separator (in this case, a comma).
The fixed query:
SELECT DISTINCT ?item ?itemLabel
(group_concat(distinct ?author;separator=",") as ?author)
(group_concat(distinct ?name;separator=",") as ?name)
(group_concat(distinct ?creation;separator=",") as ?creation)
(group_concat(distinct ?picture;separator=",") as ?picture)
WHERE
{
?item wdt:P136 wd:Q193606 . # book
?item wdt:P50 ?author . # author
?item wdt:P577 ?creation .
?item wdt:P577 ?end .
?author rdfs:label ?name .
OPTIONAL{ ?item wdt:P18 ?picture }
FILTER (?creation >= "1925-01-01T00:00:00Z"^^xsd:dateTime) .
FILTER (?end <= "1950-12-31T23:59:59Z"^^xsd:dateTime) .
SERVICE wikibase:label
{
bd:serviceParam wikibase:language "en" .
}
}
group by ?item ?itemLabel

Related

How can I group multiple results into one cell with SPARQL in Wikidata

I'm trying to pull (lots of) data for one of my projects.
Specifically trying to get some data on biblical figures.
However, I've noticed that when there are mutiple results per column, I get the results in a new raw. Meaning, there is no option to put multiple results in one row , with a seperator for example.
For example, since some biblical figures have more than one sibling, I get the results in mutpile rows:
Here's an example for a query with siblings
I tried to group by but got an error:
select ?person ?personLabel ?siblingLabel (GROUP_CONCAT(?personLabel) AS ?personLabels)
where {
?person wdt:P31 wd:Q20643955.
?person wdt:P3373 ?sibling.
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
}
}
GROUP BY ?person
ORDER BY ?personLabel

If you want to have all siblings in a cell, you have to use GROUP_CONCAT on ?siblingLabel, not ?personLabel. To omit duplicate labels, you can add DISTINCT to it. To use a delimiter (e.g., a semicolon), you can add SEPARATOR to it.
(GROUP_CONCAT(DISTINCT ?siblingLabel; SEPARATOR="; ") AS ?siblingLabels)
To the GROUP BY you have to add all other variables.
As you are getting the labels with Wikidata’s label service, one more step is needed: You either have to use a sub-query, or you have to list the labels you need in the SERVICE.
Using the latter, your query could be:
SELECT ?person ?personLabel (GROUP_CONCAT(DISTINCT ?siblingLabel; SEPARATOR="; ") AS ?siblingLabels)
WHERE {
?person wdt:P31 wd:Q20643955 ;
wdt:P3373 ?sibling .
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
?sibling rdfs:label ?siblingLabel .
?person rdfs:label ?personLabel .
}
}
GROUP BY ?person ?personLabel
ORDER BY ?personLabel

Wikidata SPARQL - Duplicate results for spouse start time and end time

I am trying to construct a query to return a list of actors and their spouses while including marriage and divorce dates for each couple. So I would expect to see each actor duplicate with each instance of a new relationship... however when I try and include the start time and end time properties in the query, I am getting duplicate results. I suspect this is because the "name" of the spouses and the is stored in a different wikidata prefix and I'm not grouping them correctly.
Here is a sample query:
SELECT ?person ?personLabel ?spouse ?spouseLabel ?starttime ?endtime
WHERE
{
?person wdt:P106 wd:Q33999, wd:Q2526255, wd:Q28389, wd:Q3282637;
wdt:P26 ?spouse.
?person p:P26 [pq:P580 ?starttime; pq:P582 ?endtime].
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ASC(UCASE(str(?personLabel)))
LIMIT 10
here is a link to the sparql interactive service so you can see the duped results I'm referring to:
https://query.wikidata.org/#SELECT%20%3Fperson%20%3FpersonLabel%20%3Fspouse%20%3FspouseLabel%20%3Fstarttime%20%3Fendtime%0AWHERE%0A%7B%0A%20%20%3Fperson%20wdt%3AP106%20wd%3AQ33999%2C%20wd%3AQ2526255%2C%20wd%3AQ28389%2C%20wd%3AQ3282637%3B%0A%20%20%20%20%20%20%20%20%20%20wdt%3AP26%20%3Fspouse.%0A%20%20%3Fperson%20p%3AP26%20%5Bpq%3AP580%20%3Fstarttime%3B%20pq%3AP582%20%3Fendtime%5D.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22.%20%7D%0A%7D%0AORDER%20BY%20ASC%28UCASE%28str%28%3FpersonLabel%29%29%29%0ALIMIT%2010%0A
screencap of duped results

The problem with your query is that there was no link between the spouse and the statement about their marriage.
So for every actor, you are returning all their spouses, and also all the start/end dates of their marriages, regardless of whether they relate to the specific spouse.
What you need to do is to use the ps: namespace, like so:
SELECT ?person ?personLabel ?spouse ?spouseLabel ?starttime ?endtime
WHERE
{
?person wdt:P106 wd:Q33999, wd:Q2526255, wd:Q28389, wd:Q3282637 .
?person p:P26 [ ps:P26 ?spouse ; #This is the necessary change.
pq:P580 ?starttime;
pq:P582 ?endtime ].
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
ORDER BY ASC(?personLabel)
LIMIT 10
In general, the wdt: namespace is for linking entities directly, the p: namespace links an entity to a statement, ps: links a statement to an entity, and pq: tells us something about the statement.

Why this wikidata SPARQL query is missing country information?

This SPARQL query on Wikidata is missing the form of government for a lot of entries. My query:
SELECT DISTINCT ?country ?countryLabel
(group_concat(DISTINCT ?bfogLabel;separator=", ") as ?Government)
WHERE
{
?country wdt:P31 wd:Q3624078.
OPTIONAL {?country wdt:P122 ?bfog } . # basic form of government
SERVICE wikibase:label
{ bd:serviceParam wikibase:language "en" .
?country rdfs:label ?countryLabel .
?bfog rdfs:label ?bfogLabel .
}
}
GROUP BY ?country ?countryLabel
ORDER BY ?countryLabel
Angola is in Wikipedia's infobox: "Unitary dominant-party presidential constitutional republic". But it is empty in this query.
Why is that? More important: is there any fix for this? I saw in this question that wikidata is not as reliable as possible when it comes to data categorization.
Try it out here

Retrieving data from blank nodes in Wikidata

I am attempting to retrieve data about the lifespans of certain people. This is problematic in cases of people that have lived a while ago. The dataset for e.g. Pythagoras seems to have a so called "blank node" for date of birth (P569). But this blank node references another node earliest date (P1319) which has data I could work with just fine.
But for some reason I am not able to retrieve that node. My first try looked like this, but somehow that results in a completly empty result set:
SELECT DISTINCT ?person ?name ?dateofbirth ?earliestdateofbirth WHERE {
?person wdt:P31 wd:Q5. # This thing is Human
?person rdfs:label ?name. # Name for better conformation
?person wdt:P569 ?dateofbirth. # Birthday may result in a blank node
?dateofbirth wdt:P1319 ?earliestdateofbirth # Problem: Plausbible Birth
}
I then found another Syntax that suggested using ?person wdt:P569/wdt:P1319 ?earliestdateofbirth as some kind of "shortcut"-syntax for the explicit navigation I did above but this also ends with a empty result set.
SELECT DISTINCT ?person ?name ?dateofbirth ?earliestdateofbirth WHERE {
?person wdt:P31 wd:Q5. # Is Human
?person rdfs:label ?name. # Name for better conformation
?person wdt:P569/wdt:P1319 ?earliestdateofbirth.
}
So how do I access a node referenced by a blank node (in my case specifically the earliest birthdate) in Wikidata?

But this blank node references another node…
Things are slightly different. The earliest date property is not a property of _:t550690019, but rather is a property of the statement wd:Q10261 wdt:P569 _:t550690019.
In the Wikidata data model, these annotations are expressed using qualifiers.
Your query should be:
SELECT DISTINCT ?person ?name ?dateofbirth ?earliestdateofbirth WHERE {
VALUES (?person) {(wd:Q10261)}
?person wdt:P31 wd:Q5. # --Is human
?person rdfs:label ?name. # --Name for better conformation
?person p:P569/pq:P1319 ?earliestdateofbirth.
FILTER (lang(?name) = "en")
}
Try it!
By the way, time precision (which is used when date of birth is known) is yet another qualifier:
SELECT ?person ?personLabel ?value ?precisionLabel {
VALUES (?person) {(wd:Q859) (wd:Q9235)}
?person wdt:P31 wd:Q5 ;
p:P569/psv:P569 [ wikibase:timeValue ?value ;
wikibase:timePrecision ?precisionInteger ]
{
SELECT ?precision (xsd:integer(?precisionDecimal) AS ?precisionInteger) {
?precision wdt:P2803 ?precisionDecimal .
}
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
Try it!

OR in sparql query

This sparql query on wikidata shows all places in Germany (Q183) with a name that ends in -ow or -itz.
I want to extend this to look for places in Germany and, say, Austria.
I tried modifying the 8th line to something like:
wdt:P17 (wd:Q183 || wd:Q40);
in order to look for places in Austria (Q40), but this is not a valid query.
What is a way to extend the query to include other countries?

Afaik there is no syntax as simple as that. You can, however, use UNION to the same effect like this:
SELECT ?item ?itemLabel ?coord
WHERE
{
?item wdt:P31/wdt:P279* wd:Q486972;
rdfs:label ?itemLabel;
wdt:P625 ?coord;
{?item wdt:P17 wd:Q183}
UNION
{?item wdt:P17 wd:Q40}
FILTER (lang(?itemLabel) = "de") .
FILTER regex (?itemLabel, "(ow|itz)$").
}
or as an alternative create a new variable containing both countries using VALUES:
SELECT ?item ?itemLabel ?coord
WHERE
{
VALUES ?country { wd:Q40 wd:Q183 }
?item wdt:P31/wdt:P279* wd:Q486972;
wdt:P17 ?country;
rdfs:label ?itemLabel;
wdt:P625 ?coord;
FILTER (lang(?itemLabel) = "de") .
FILTER regex (?itemLabel, "(ow|itz)$").
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Wikidata query duplicates - sparql

Related

How can I group multiple results into one cell with SPARQL in Wikidata

Wikidata SPARQL - Duplicate results for spouse start time and end time

Why this wikidata SPARQL query is missing country information?

Retrieving data from blank nodes in Wikidata

OR in sparql query

Categories

Resources