Query wikidata for wikipedia urls in multiple langauges - sparql

I'm trying to use Wikidata as an intermediary to get from a unique identifier listed in Wikidata (for example VIAF ID) to a Wikipedia description.
I've managed to piece together this query to get the Wikipedia page ID from a given VIAF ID ("153672966" below is the VIAF ID for "Southern Illinois University Press"):
SELECT ?pageid WHERE {
?item wdt:P214 "153672966".
[ schema:about ?item ; schema:name ?name ;
schema:isPartOf <https://en.wikipedia.org/> ]
SERVICE wikibase:mwapi {
bd:serviceParam wikibase:endpoint "en.wikipedia.org" .
bd:serviceParam wikibase:api "Generator" .
bd:serviceParam mwapi:generator "allpages" .
bd:serviceParam mwapi:gapfrom ?name .
bd:serviceParam mwapi:gapto ?name .
?pageid wikibase:apiOutput "#pageid" .
}
}
This results in the pageid 9393762 which I am able to lookup in the Wikipedia API and get the introduction text I need using this request:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&pageids=9393762
The resulting response includes an unparsed description (explaintext) taken from the first section in the wikipedia article, so this gets me where I need to be given the language is english.
Now the problem is that I need to use this on a internationalized site where I might not even know upfront which languages might be used in the future. The query against Wikidata is supposed to run as a batch job on the backend, while fetching the actual descriptions from Wikipedia will be done from the frontend and be rendered asynchronously.
Ideally I would want the Wikidata query to return a pageid for each given language where there is a Wikipedia article available. On the frontend I would then check whether the current active language has a pageid associated and call the Wikipedia api or render a fallback if no pageid is given.
In the future I would need to make similar queries with other library related identifiers such as ISNI for example, but I don't imagine that being much different than the current use-case.
Is this a reasonable way to get the job done and how can I expand it to support multiple languages?

To get the explaintext you don't necessarily need the pageid but the page title is enough.
The page titles in all languages you get from Wikidata with the following query:
SELECT ?item ?title ?site WHERE {
?item wdt:P214 "153672966" .
[ schema:about ?item ; schema:name ?title ;
schema:isPartOf ?site ] .
}
And afterwards you can use Wikipedia API to get the explaintext:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles=Southern Illinois University Press
The downside of working with page titles is that they are not stable. So you will need to run your batch job regularly to check for renamings of articles.

Related

Wikidata query to select items with both Simple Wikipedia & English Wikipedia Pages

I am trying to get Wikidata entries that have both a Simple Wikipedia page, and an English Wikipedia page, and have both of the links. So far I have:
SELECT ?item ?itemLabel ?articleSimple ?article
WHERE {
?articleSimple schema:about ?item ;
schema:isPartOf <https://simple.wikipedia.org/> ;
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]" }
}
LIMIT 10
(Link to demo)
Right now, this successfully selects the Simple Wikipedia pages, but I'm not able to get the regular Wikipedia page.
I've tried with an "OPTIONAL" block, or another "schema:about", but have not been successful. Thank you for any help provided!

Can I use generators in WikiData's API?

I am trying to get all of the pages in given category from wikipedia, including ones in subcategories. No problem with that, but I also want certain fields from each page, like birth date.
From this topic I suppose I need to use https://wikidata.org/w/api.php and not for example https://pl.wikipedia.org/...
I assumed I should use generator, but my trouble is that with calling WikiData I get an error about bad ID, which I don't get for Wikipedia.
query.params = {
"action": "query", // placeholder for test
"generator": "categorymembers",
"gcmpageid": 1810130, // sophists'category at pl.wikipedia
"format": "json"
}
https://pl.wikipedia.org/w/api.php -> data
https://en.wikipedia.org/w/api.php -> error: nosuchpage (expected)
https://www.wikidata.org/w/api.php -> error: invalidcategory (why???)
I've tried to use that id from WikiData prefixed with "Q", but then I got badinteger
Alternatively I could make requests to Wikipedia for ids and then to WikiData, but calling two times for the same thing and handling all that ids into request...
Please help
TL;DR Using generators from Polish Wikipedia in Wikidata API does not work but other solutions exist.
A few things to note about Wikidata and its API:
Wikidata doesn't know anything about the category hierarchies on Polish Wikipedia (or on any other Wikipedia language version)
There is no API to query pages in all subcategories. This is mainly because the catgory system of MediaWiki allows cycles in the category hierarchies and infinite levels of nested categories.
pageIds are only unique within a project. So using a pageId from pl.wikipedia.org does not work on https://en.wikipedia.org/w/api.php or https://www.wikidata.org/w/api.php
There are multiple solutions to your problem:
Use the query in your question recursively to get all page titles from Kategoria:Sofiści and its subcategories.
Afterwards, use the Wikidata API to retrieve the Wikidata item for each Polish Wikipedia article: e.g. for Protagoras the query is this: https://www.wikidata.org/w/api.php?action=wbgetentities&sites=plwiki&titles=Protagoras&props=claims&format=json
This returns a json file with all statements about Protagoras stored on Wikidata. The birth data you find in that file under claims->P569->mainsnak->datavalue->value->time.
Use the Wikidat Query Service. It allows you to call out MediaWiki API from SPARQL.
SELECT ?item ?itemLabel ?date_of_birth WHERE {
SERVICE wikibase:mwapi {
bd:serviceParam wikibase:api "Generator" .
bd:serviceParam wikibase:endpoint "pl.wikipedia.org" .
bd:serviceParam mwapi:gcmtitle 'Kategoria:Sofiści' .
bd:serviceParam mwapi:generator "categorymembers" .
bd:serviceParam mwapi:gcmprop "ids|title|type" .
bd:serviceParam mwapi:gcmlimit "max" .
?item wikibase:apiOutputItem mwapi:item .
}
?item wdt:P569 ?date_of_birth
SERVICE wikibase:label { bd:serviceParam wikibase:language "pl". }
}
Insert this query on https://query.wikidata.org/. That page also offers you code examples how to access the results programmatically.
The drawback of this solution is, that pages in subcategories are not included.
Fully rely on Wikidata. Use the following query in https://query.wikidata.org/:
SELECT ?item ?itemLabel ?date_of_birth WHERE {
?item wdt:P106 wd:Q3750514.
?item wdt:P569 ?date_of_birth
SERVICE wikibase:label { bd:serviceParam wikibase:language "pl,en". }
}

Filter search in Wikidata API

I am trying to get filtered data from the Wikidata API, currently I can do a general search using this API, but now there have been specific cases where I have to filter this information, for example, I need to get a list of only authors to get the Q identifier and although I also reviewed the Wikidata Query Service this is too heavy to bring all the items, I used a SPARQL query and did a test and to get less than 3000 results it took 26 seconds, this is too much for a search service.
This is the query I use to get the authors.
SELECT DISTINCT ?author ?authorLabel WITH {
SELECT ?item ?author WHERE {
?item wdt:P50 ?author.
} LIMIT 100000
} AS %FOO {
INCLUDE %FOO
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
I also need to search by categories but it has not been possible for me to filter the searches in any way, does anyone know a way to do it?

How to construct SPARQL query for a list of Wikidata items

First off, I'm not a developer, and I'm new to writing SPARQL queries. Mostly I've been looking up existing queries and trying to tweak them to get what I need. The issue is that most documentation on query construction have to do with getting new data you don't have, rather than retrieving or extending existing data. And when you do find tips for retrieving existing data, they tend to be for ONE item at a time instead of a full data set of many items.
I mostly use OpenRefine for this. I start by loading up my existing list of names, and used the Wikidata extension service to reconcile the names to existing Wikidata IDs. So now, this is where I am, vs. where I want to go:
1 - We have a list of Wikidata IDs for reconciled matches;
2 - We have used OpenRefine to get most of the data we need from those;
3 - We don't have the label, description, or Wikipedia links (English), which are extremely valuable;
4 - I have figured out how to construct a query for the label and description of just ONE Wikidata Item:
SELECT ?itemLabel ?itemDescription WHERE { VALUES ?item {
wd:Q15485689 } SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
5 - I have figured out how to construct a query to extract the Wikipedia English URL for just ONE Wikidata item:
SELECT ?article ?lang ?name WHERE {
?article schema:about wd:Q15485689;
schema:inLanguage ?lang;
schema:name ?name;
schema:isPartOf _:b13.
_:b13 wikibase:wikiGroup "wikipedia".
FILTER(?lang IN("en"))
FILTER(!(CONTAINS(?name, ":")))
OPTIONAL { ?article wdt:P31 ?instance_of. }
}
The questions are:
How do I modify either query to generate these same results for MORE THAN ONE* Wikidata item?
How do I modify the query to give me all three at once, for more than one* Wikidata item?
*we have 667, but I could do smaller batches if that's too much for the service to handle
Ideally, the query would generate something that allowed me to download a CSV file looking much like this (so I can match on and import the new data into our Airtable base which feeds the website application):
ideal CSV output
If anyone can lead me in the right direction here, I'd appreciate it.
I should also note that if OpenRefine has a way of retrieving these I'm all ears! But since these three don't have a property code, I couldn't see how to snag them from OR.
This sort of thing. See how many QIds you can get away with in the values statement. All of them in one go, probably. This query gives you the URL and the article title; clearly, you can snip the article title column if you do not want it. Note also https://www.wikidata.org/wiki/Wikidata:Request_a_query which is wikidata's own location for questions such as these.
SELECT ?item ?itemLabel ?itemDescription ?sitelink ?article
WHERE
{
VALUES ?item {wd:Q105848230 wd:Q6697407 wd:Q2344502 wd:Q1698206}
OPTIONAL {
?article schema:about ?item ;
schema:isPartOf <https://en.wikipedia.org/> ;
schema:name ?sitelink .
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Yes, a VALUES statement in SPARQL can relay not only hundreds but even thousands of items. I regularly do this when cross-checking to see how Wikidata matches up to an existing data set. Some other things you could do as well that take lists of Wikidata items:
Petscan - https://petscan.wmflabs.org/
TABernacle - https://tabernacle.toolforge.org/

Wikipedia API and SPARQL in a single query

I need to search for Wikipedia pages that contain some specific words in their full text. To improve the results I want to limit the results to pages describing entities that are instances of a specific entity.
For searching the full text I can use the Wikipedia APIs, using the query action and the search generator.
For filtering instances of a given entity I can use the Wikidata APIs and a SPARQL query.
Is there a way to execute both operations in a single query that applies both filters?
Since June 2017, it is possible to call out to Wikimedia APIs from Wikidata SPARQL:
SELECT ?wikidata_item ?wikipedia_title {
SERVICE wikibase:mwapi {
bd:serviceParam wikibase:endpoint "en.wikipedia.org" .
bd:serviceParam wikibase:api "Generator" .
bd:serviceParam mwapi:generator "search" .
bd:serviceParam mwapi:gsrsearch "triplestore" .
bd:serviceParam mwapi:gsrlimit "max" .
?wikidata_item wikibase:apiOutputItem mwapi:item .
?wikipedia_title wikibase:apiOutput mwapi:title .
}
# hint:Prior hint:runFirst "true".
?wikidata_item wdt:P31 wd:Q3539533 .
FILTER (?wikipedia_title != "Blazegraph")
}
Try it!
No, those have completely separate search backends that do not interact. The Wikidata API uses SQL queries; the search API uses Elasticsearch; the SPARQL service uses Blazegraph.