Filter search in Wikidata API - sparql

I am trying to get filtered data from the Wikidata API, currently I can do a general search using this API, but now there have been specific cases where I have to filter this information, for example, I need to get a list of only authors to get the Q identifier and although I also reviewed the Wikidata Query Service this is too heavy to bring all the items, I used a SPARQL query and did a test and to get less than 3000 results it took 26 seconds, this is too much for a search service.
This is the query I use to get the authors.
SELECT DISTINCT ?author ?authorLabel WITH {
SELECT ?item ?author WHERE {
?item wdt:P50 ?author.
} LIMIT 100000
} AS %FOO {
INCLUDE %FOO
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
I also need to search by categories but it has not been possible for me to filter the searches in any way, does anyone know a way to do it?

Related

Can I use generators in WikiData's API?

I am trying to get all of the pages in given category from wikipedia, including ones in subcategories. No problem with that, but I also want certain fields from each page, like birth date.
From this topic I suppose I need to use https://wikidata.org/w/api.php and not for example https://pl.wikipedia.org/...
I assumed I should use generator, but my trouble is that with calling WikiData I get an error about bad ID, which I don't get for Wikipedia.
query.params = {
"action": "query", // placeholder for test
"generator": "categorymembers",
"gcmpageid": 1810130, // sophists'category at pl.wikipedia
"format": "json"
}
https://pl.wikipedia.org/w/api.php -> data
https://en.wikipedia.org/w/api.php -> error: nosuchpage (expected)
https://www.wikidata.org/w/api.php -> error: invalidcategory (why???)
I've tried to use that id from WikiData prefixed with "Q", but then I got badinteger
Alternatively I could make requests to Wikipedia for ids and then to WikiData, but calling two times for the same thing and handling all that ids into request...
Please help
TL;DR Using generators from Polish Wikipedia in Wikidata API does not work but other solutions exist.
A few things to note about Wikidata and its API:
Wikidata doesn't know anything about the category hierarchies on Polish Wikipedia (or on any other Wikipedia language version)
There is no API to query pages in all subcategories. This is mainly because the catgory system of MediaWiki allows cycles in the category hierarchies and infinite levels of nested categories.
pageIds are only unique within a project. So using a pageId from pl.wikipedia.org does not work on https://en.wikipedia.org/w/api.php or https://www.wikidata.org/w/api.php
There are multiple solutions to your problem:
Use the query in your question recursively to get all page titles from Kategoria:Sofiści and its subcategories.
Afterwards, use the Wikidata API to retrieve the Wikidata item for each Polish Wikipedia article: e.g. for Protagoras the query is this: https://www.wikidata.org/w/api.php?action=wbgetentities&sites=plwiki&titles=Protagoras&props=claims&format=json
This returns a json file with all statements about Protagoras stored on Wikidata. The birth data you find in that file under claims->P569->mainsnak->datavalue->value->time.
Use the Wikidat Query Service. It allows you to call out MediaWiki API from SPARQL.
SELECT ?item ?itemLabel ?date_of_birth WHERE {
SERVICE wikibase:mwapi {
bd:serviceParam wikibase:api "Generator" .
bd:serviceParam wikibase:endpoint "pl.wikipedia.org" .
bd:serviceParam mwapi:gcmtitle 'Kategoria:Sofiści' .
bd:serviceParam mwapi:generator "categorymembers" .
bd:serviceParam mwapi:gcmprop "ids|title|type" .
bd:serviceParam mwapi:gcmlimit "max" .
?item wikibase:apiOutputItem mwapi:item .
}
?item wdt:P569 ?date_of_birth
SERVICE wikibase:label { bd:serviceParam wikibase:language "pl". }
}
Insert this query on https://query.wikidata.org/. That page also offers you code examples how to access the results programmatically.
The drawback of this solution is, that pages in subcategories are not included.
Fully rely on Wikidata. Use the following query in https://query.wikidata.org/:
SELECT ?item ?itemLabel ?date_of_birth WHERE {
?item wdt:P106 wd:Q3750514.
?item wdt:P569 ?date_of_birth
SERVICE wikibase:label { bd:serviceParam wikibase:language "pl,en". }
}

How to construct SPARQL query for a list of Wikidata items

First off, I'm not a developer, and I'm new to writing SPARQL queries. Mostly I've been looking up existing queries and trying to tweak them to get what I need. The issue is that most documentation on query construction have to do with getting new data you don't have, rather than retrieving or extending existing data. And when you do find tips for retrieving existing data, they tend to be for ONE item at a time instead of a full data set of many items.
I mostly use OpenRefine for this. I start by loading up my existing list of names, and used the Wikidata extension service to reconcile the names to existing Wikidata IDs. So now, this is where I am, vs. where I want to go:
1 - We have a list of Wikidata IDs for reconciled matches;
2 - We have used OpenRefine to get most of the data we need from those;
3 - We don't have the label, description, or Wikipedia links (English), which are extremely valuable;
4 - I have figured out how to construct a query for the label and description of just ONE Wikidata Item:
SELECT ?itemLabel ?itemDescription WHERE { VALUES ?item {
wd:Q15485689 } SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
5 - I have figured out how to construct a query to extract the Wikipedia English URL for just ONE Wikidata item:
SELECT ?article ?lang ?name WHERE {
?article schema:about wd:Q15485689;
schema:inLanguage ?lang;
schema:name ?name;
schema:isPartOf _:b13.
_:b13 wikibase:wikiGroup "wikipedia".
FILTER(?lang IN("en"))
FILTER(!(CONTAINS(?name, ":")))
OPTIONAL { ?article wdt:P31 ?instance_of. }
}
The questions are:
How do I modify either query to generate these same results for MORE THAN ONE* Wikidata item?
How do I modify the query to give me all three at once, for more than one* Wikidata item?
*we have 667, but I could do smaller batches if that's too much for the service to handle
Ideally, the query would generate something that allowed me to download a CSV file looking much like this (so I can match on and import the new data into our Airtable base which feeds the website application):
ideal CSV output
If anyone can lead me in the right direction here, I'd appreciate it.
I should also note that if OpenRefine has a way of retrieving these I'm all ears! But since these three don't have a property code, I couldn't see how to snag them from OR.
This sort of thing. See how many QIds you can get away with in the values statement. All of them in one go, probably. This query gives you the URL and the article title; clearly, you can snip the article title column if you do not want it. Note also https://www.wikidata.org/wiki/Wikidata:Request_a_query which is wikidata's own location for questions such as these.
SELECT ?item ?itemLabel ?itemDescription ?sitelink ?article
WHERE
{
VALUES ?item {wd:Q105848230 wd:Q6697407 wd:Q2344502 wd:Q1698206}
OPTIONAL {
?article schema:about ?item ;
schema:isPartOf <https://en.wikipedia.org/> ;
schema:name ?sitelink .
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
Yes, a VALUES statement in SPARQL can relay not only hundreds but even thousands of items. I regularly do this when cross-checking to see how Wikidata matches up to an existing data set. Some other things you could do as well that take lists of Wikidata items:
Petscan - https://petscan.wmflabs.org/
TABernacle - https://tabernacle.toolforge.org/

How to get Mediawiki SPARQL query with return text of article?

For example i need to get all peoples names in wikipedia and it pages text (parsed or not- it's not important).
I write SPARQL query...
SELECT ?human ?humanLabel WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
?human wdt:P31 wd:Q5.
}
LIMIT 10
How in this query get a full text of articles with addition column?
You can't. SPARQL is designed to get the data only from wikidata. So the best solution for you is to run your query first, then loop over it, and run the following API for each record to get the page text.
https://en.wikipedia.org//w/api.php?action=query&format=json&prop=revisions&titles=Barack_Obama&utf8=1&rvprop=content
Change barack obama to the page title.

Efficient filter query in Wikidata

I am trying to form an efficient filter query in SPARQL on Wikidata. Let me explain my process:
I query the search-entities API using key words e.g. (Apple, Orange)
The API query returns a list of relevant item ID's e.g. (wd:Q629269, wd:Q154950, wd:Q312, wd:Q95, wd:Q4878289, wd:Q10817602)
With this list of ID's, I then query SPARQL and to return items that are CLASS or are SUBLCASS of certain types e.g. (p:P31/ps:P31/wdt:P279* wd:Q43229) - which returns everything if it is an Organisation or subclass thereof.
Then for items in the list of ID's, that are of certain CLASS, return information items if they exists e.g. (OPTIONAL).
I am new to SPARQL. My Question is, is this the most efficient method to achieve this output? It seems to me to be quite inefficient and I cannot find a similar type of problem in the tutorial examples.
You can try the query here.
SELECT distinct ?item ?itemLabel ?itemDescription ?web ?inception ?ISIN
WHERE{
FILTER (?item IN (wd:Q629269, wd:Q154950, wd:Q312, wd:Q95, wd:Q4878289, wd:Q10817602))
?item p:P31/ps:P31/wdt:P279* wd:Q43229.
OPTIONAL {
?item wdt:P856 ?web. # get item-web
?item wdt:P571 ?inception. # get item-web
?item wdt:P946 ?ISIN. # get item-isin
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }
}
LIMIT 10

Apparently simple Wikidata SPARQL filter query is slow

Absolute Wikidata and SPARQL beginner here. I am trying to find out the Q code of a particular female name, say Jennifer. I can get it with a query like this:
SELECT ?name WHERE {
?name wdt:P31 wd:Q11879590.
?name rdfs:label ?label.
FILTER((STR(?label)) = "Jennifer")
SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 1
That is, I look up entities that are instance of "female given name" and then filter to those with "Jennifer" in the label. It works, but it takes 5 s or more.
If I omit the LIMIT 1 I get many instances of the same results, which signals to me that I am doing something stupid.
Bottom line, is there an efficient way to look up the Q code for a "female given name"?