How to get all Wikipedia page links with their pageIDs? - wikipedia-api

Starting a request like that:
https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Title&prop=links&pllimit=500
provides me a list of links (that the page contains) where every link consists of the title and the ns (namespace)
Is there a way to also get the PageID together with title & ns? (the less work it is for the sever the better of course)

You need to use generator parameter. Here is an example for Cobra Wikipedia page.
https://en.wikipedia.org/w/api.php?action=query&generator=links&titles=Cobra&prop=info&gpllimit=500

Related

how to get table info and summary of page using Wikipedia api?

I want to get minimal information of a Wikipedia page using MediaWiki API like DuckDuckGo. For example for Steve Carell: https://duckduckgo.com/?q=steve+carell&t=hp&ia=news&iax=about
How can I get this information with a Wikipedia url (eg https://en.wikipedia.org/wiki/Steve_Carell) in HTML format?
You can use the MediaWiki API for that. There's an extension, TextExtracts, which is exactly for that (and it is installed on Wikipedia).
In your case, e.g.:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exsentences=1&titles=Steve%20Carell
will return something like:
<p class=\"mw-empty-elt\">\n</p>\n\n<p class=\"mw-empty-elt\">\n \n</p>\n<p><b>Steven John Carell</b> (<span></span>; born August 16, 1962) is an American actor, comedian, producer, writer and director.</p>
You can customize how many sentences (or characters) the API returns, as well, please consult the API documentation for that.
There's also the way to retrieve the short description, which is saved at Wikidata (and visible in the mobile view of Wikipedia). This call would be:
https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&titles=Steve_Carell
This returns the following property in the pageprops of the page:
"wikibase-shortdesc": "American actor"
This may fit better depending on your use case.
You can even get both of the results with a single, combined, request:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts|pageprops&exsentences=1&titles=Steve_Carell

How can I get page id, wikidata id of some title along with multiple languages in a single API call?

I have been trying to call Wikipedia API to retrieve page id and wikidata item id using below call and it works fine.
https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&ppprop=wikibase_item&redirects=1&format=xml&titles=Cat
but I need to retrieve the same information from other languages of my choice for example if I mention German and French languages in my call, it should look for their translation of word Cat and retrieve their page info. There is langlink property in Wikipedia API but somehow it doesn't work with query action along with pageprop.
So ideally, I want something like this:
https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&ppprop=wikibase_item&prop=langlinks&lllang=de&lllang=fr&titles=Cat
Any help would be appreciated.
Using lllang twice will just result in the second value overwriting the first one. You'll have to omit the paramter and then you get all the links:
https://en.wikipedia.org/w/api.php?action=query&prop=pageprops|langlinks&ppprop=wikibase_item&titles=Cat

Section content using MediaWiki API

I'm using the MediaWiki API to get the content of a Wikipedia page like this in JSON.
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=New_York&prop=extracts
I'd like each section to be separated out instead of having the entire content of the page as one value. I know you can get each section like this but I want it to also include the content with each section.
http://en.wikipedia.org/w/api.php?format=json&action=parse&prop=sections&page=New_York
Is this possible to do with the API?
If you know the number of the section which you want you can get the contents through action=parse with the section parameter. E.g. the "19th century" section of the New_York article would be:
https://en.wikipedia.org/w/api.php?action=parse&page=New_York&format=json&prop=wikitext&section=4
To get the section number you can use
http://en.wikipedia.org/w/api.php?format=json&action=parse&prop=sections&page=New_York
and then find the index corresponding to your section title (line). In this case "line":"19th century","index":"4".

Query Wikipedia pages with properties

I need to use Wikipedia API Query or any other api such as Opensearch to query for a simple list of pages with some properties.
Input: a list of page (article) titles or ids.
Output: a list of pages that contain the following properties each:
page id
title
snippet/description (like in opensearch api)
page url
image url (like in opensearch api)
A result similar to this:
http://en.wikipedia.org/w/api.php?action=opensearch&search=miles%20davis&limit=20&format=xml
Only with page ids and not for a search, but rather an exact list of pages by either titles or pageids.
This should be a fairly simple thing but I have been stuck with that for quite some time trying all kinds of URL combinations from the MW api manual, without success.
I dont't think there is another way than the Open Search API to fetch Open Search data, but depending on which Wikipedia you are interested in, there might be other extensions installed to help you. Taking English Wikipedia as an example, we can make use of the MobileFrontend and PageImages extensions, that happens to be installed there.
Title and url are available from the native MediaWiki API. To get the url, you can use prop=info, and specify with inprop=url that it is the url you are interested in.
Prominent images of a page is returned by prop=pageimages, thanks to PageImages.
MobileFrontend adds a property called extracts, that you can use with the directive exintro to get the first paragraph. Note however that MediWiki markup is complex, and result might not always be perfect. If we put it all together in one single query, it would be something like this:
http://en.wikipedia.org/w/api.php?action=query&pageids=21482&prop=pageimages|info|extracts&inprop=url&exintro
giving this:
<api>
<query>
<pages>
<page pageid="21482" ns="0" title="Nairobi" pageimage="Nairobi_Montage.jpg" contentmodel="wikitext" pagelanguage="en" touched="2014-02-06T06:10:01Z" lastrevid="594161616" counter="" length="89157" fullurl="http://en.wikipedia.org/wiki/Nairobi" editurl="http://en.wikipedia.org/w/index.php?title=Nairobi&action=edit">
<thumbnail source="http://upload.wikimedia.org/wikipedia/commons/thumb/6/66/Nairobi_Montage.jpg/45px-Nairobi_Montage.jpg" width="45" height="50" />
<extract xml:space="preserve">
<p><b>Nairobi</b> /naɪˈroʊbi/ is the [...]
</extract>
</page>
</pages>
</query>
</api>
Here is a multistep process to get a list of Wikipedia page titles and properties for articles, and then getting the page IDs and URLS.
Please note: It does use a portion of a previous answer: "Title and url are available from the native MediaWiki API. To get the url, you can use prop=info, and specify with inprop=url that it is the url you are interested in."
If you would like to use the Wikipedia API for your own applications and search Wikipedia for getting a list of articles about a certain topic, and you wanted the answer in JSON format, then you could could use the following URL:
https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=REPLACE_ME_WITH_SEARCH_TOPIC&format=json&callback=?
And if your eyes are having trouble parsing results from that, then replace "format=json&callback=?" with "formatversion=2" like the following example to make it easier for your eyes:
https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=REPLACE_ME_WITH_SEARCH_TOPIC&formatversion=2
The following example will give me a batch list of article titles and properties about/for "Thailand" in JSON format, and after that I will use the resulting titles to find the page IDs and URLS of those articles.
URL step 1:
https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=thailand&format=json&callback=?
From step 1, I can get the list of titles I need from inside the resulting JSON, with step 2, I use those titles gained in step 1 in another API query (aka step 2) for gaining the page IDs and URLs of those articles in the resulting JSON...results of step2.
Here are the Wikipedia article titles from the resulting JSON of step 1:
Thailand
Outline of Thailand
Geography of Thailand
Economy of Thailand
Football in Thailand
Southern Thailand
Government of Thailand
Northern Thailand
Culture of Thailand
Cinema of Thailand
URL step 2:
https://en.wikipedia.org/w/api.php?action=query&titles=Thailand|Outline%20of%20Thailand|Geography%20of%20Thailand|Economy%20of%20Thailand|Football%20in%20Thailand|Southern%20Thailand|Government%20of%20Thailand|Northern%20Thailand|Culture%20of%20Thailand|Cinema%20of%20Thailand&prop=info&inprop=url&format=json&callback=?

How to get the result of "all pages with prefix" using Wikipedia api?

I wish to use Wikipedia api to extract the result of this page:
http://en.wikipedia.org/wiki/Special:PrefixIndex
When searching "something" on it, for example this:
http://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=tal&namespace=4
Then, I would like to access each of the resulting pages and extract their information.
What api call might I use?
You can use list=allpages and specify apprefix. For example:
http://en.wikipedia.org/w/api.php?format=xml&action=query&list=allpages&apprefix=tal&aplimit=max
This query will give you the id and title of each article that starts with tal. If you want to get more information about each page, you can use this list as a generator:
http://en.wikipedia.org/w/api.php?format=xml&action=query&generator=allpages&gapprefix=tal&gaplimit=max&prop=info
You can give different values to the prop parameter to get different information about the page.