How can I get the Infobox from a Wikipedia article by the MediaWiki API? [duplicate] - wikipedia-api

This question already has answers here:
How to get the Infobox data from Wikipedia?
(8 answers)
Closed 3 years ago.
Wikipedia articles may have Infobox templates. By the following call I can get the first section of an article which includes an Infobox.
http://en.wikipedia.org/w/api.php?action=parse&pageid=568801&section=0&prop=wikitext
I want a query which will return only Infobox data. Is this possible?

You can do it with a URL call to the Wikipedia API like this:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=Scary%20Monsters%20and%20Nice%20Sprites&rvsection=0
Replace the titles= section with your page title, and format=xmlfm to format=json if you want the article in JSON format.

Instead of parsing infoboxes yourself, which is quite complicated, take a look at DBPedia, which has Wikipedia infoboxes extracted out as database objects.

Building on garry's answer, you can have Wikipedia parse the info box into HTML for you via the rvparse parameter like so:
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Scary%20Monsters%20and%20Nice%20Sprites&rvsection=0&rvparse
Note that neither method will return just the info box. But from the HTML content, you can extract (via, e.g., Beautiful Soup) the table with class infobox.
In Python, you do something like the following
resp = requests.get(url).json()
page_one = next(iter(resp['query']['pages'].values()))
revisions = page_one.get('revisions', [])
html = next(iter(revisions[0].values()))
# Now parse the HTML

If the page has a right side infobox, then use this URL to obtain it in txt form.
My example is using the element hydrogen. All you need to do is replace "Hydrogen" with your title.
https://en.wikipedia.org/w/index.php?action=raw&title=Template:Infobox%20hydrogen
If you are looking for JSON format use this URL, but it's not pretty.
https://en.wikipedia.org/w/api.php?action=parse&page=Template:Infobox%20hydrogen&format=json

Related

how to get table info and summary of page using Wikipedia api?

I want to get minimal information of a Wikipedia page using MediaWiki API like DuckDuckGo. For example for Steve Carell: https://duckduckgo.com/?q=steve+carell&t=hp&ia=news&iax=about
How can I get this information with a Wikipedia url (eg https://en.wikipedia.org/wiki/Steve_Carell) in HTML format?
You can use the MediaWiki API for that. There's an extension, TextExtracts, which is exactly for that (and it is installed on Wikipedia).
In your case, e.g.:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exsentences=1&titles=Steve%20Carell
will return something like:
<p class=\"mw-empty-elt\">\n</p>\n\n<p class=\"mw-empty-elt\">\n \n</p>\n<p><b>Steven John Carell</b> (<span></span>; born August 16, 1962) is an American actor, comedian, producer, writer and director.</p>
You can customize how many sentences (or characters) the API returns, as well, please consult the API documentation for that.
There's also the way to retrieve the short description, which is saved at Wikidata (and visible in the mobile view of Wikipedia). This call would be:
https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&titles=Steve_Carell
This returns the following property in the pageprops of the page:
"wikibase-shortdesc": "American actor"
This may fit better depending on your use case.
You can even get both of the results with a single, combined, request:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts|pageprops&exsentences=1&titles=Steve_Carell

Karate Automation: Is there any way we can set the Scenario name dynamically from a json file [duplicate]

This question already has answers here:
Can we parameterize the request file name to the Read method in Karate?
(2 answers)
Closed 1 year ago.
I am using a JSON file which act as a test case document for my API testing. The JSON contain Test Case ID, Test case Description, Header and Request body details, which should be the driving factor of Automation
Currently i am looping a feature over this json file to set different header and body validations. However it will be helpful if i can set the Scenario name from JSON file while its iterating
Something like
serverpost.feature
Feature:re-usable feature to publish data
Scenario: TC_NAME # TC_NAME is avaliable in the JSON data passed to this feature. However, CURRENTLY ITS NOT TAKING THIS DATA FROM JSON FILE.
Given path TC_ID # TC ID is taken from JSON
Given url 'http://myappurl.com:8080/mytestapp/Servers/Data/uploadServer/'
And request { some: '#(BODY)' } # Request Body Details is taken from JSON
Please suggest
In my honest opinion, you are asking for a very un-necessary feature. Please refer to the demo examples, look for it in the documentation.
Specifically, look at this one: dynamic-params.feature. There are multiple ways to create / use a data table. Instead of trying to maintain 2 files - think of Karate as being both - your data table AND the test execution. There is no need to complicate things further.
If you really really want to re-use some JSON lying around, it is up to you but you won't be able to update the scenario name, sorry. What I suggest is just use the print statement to dump the name to the log and it will appear in the HTML report (refer to the doc). Note that when calling a feature in a loop using a JSON array, the call argument is ALREADY included the report, so you may not need to do anything.
Just an observation - your questions seem to be very basic, do you mind reading the doc and the examples a bit more thoroughly, thanks.

How do I access the "See Also" Field in the Wiktionary API?

Many of the Wiktionary pages for Chinese Characters (Hanzi) include links at the top of the page to other similar-looking characters. I'd like to use the Wiktionary API to send a single character in the query and receive a list of similar characters as the response. Unfortunately, I can't seem to find any query that includes the "See Also" field. Is this kind of query possible?
The “see also” field is just a line of wiki code in the page source, and there is no way for the API to know that it's different from any other piece of text on the page.
If you are happy with using only the English version of Wiktionary, you can fetch the wikicode: index.php?title=太&action=raw, and then parse the result for the template also. In this case, the line you are looking for is {{also|大|犬}}.
To check if the template is used on the page at all, query the API for titles=太&prop=templates&tltemplates=Template:also
Similar templates are avilable in more language editions of Wiktionary, in case you want to use other sources than the English one. The current list is:
br:Patrom:gwelet
ca:Plantilla:vegeu
cs:Šablona:Viz
de:Vorlage:Siehe auch
el:Πρότυπο:δείτε
es:Plantilla:desambiguación
eu:Txantiloi:Esanahi desberdina
fi:Malline:katso
fr:Modèle:voir
gl:Modelo:homo
id:Templat:lihat
is:Snið:sjá einnig
it:Template:Vedi
ja:テンプレート:see
no:Mal:se også
oc:Modèl:veire
pl:Szablon:podobne
pt:Predefinição:ver também
ru:Шаблон:Cf
sk:Šablóna:See
sv:Mall:se även
It has been suggested that the WikiData project be expanded to cover Wiktionary. If and when that happens, you might be able to query theWikiData API for that kind of stuff!

citeseerx search api

Is there a way to access CiteSeerX programmatically (e.g. search by author and/or title?) Surprisingly I cannot find anything relevant; surely others too are trying to get scholarly article metadata without resorting to scraping?
EDIT: note that CiteSeerX supports OAI PMH, but that seems to be an API geared towards digital libraries keeping up to date with each other ("content dissemination") and does not specifically support search. Moreover the citeseer info on that page is very sparse and even says "Currently, there are difficulties with the OAI".
There is another SO question about CiteSeerX API (though not specifically search); the 2 answers do not resolve the problem (one talks about Mendeley, another piece of software, and the other says OAI-PMH implementations are free to offer extensions to the minimal spec).
Alternatively, can anyone suggest a good way to obtain citations from authors/titles programmatically?
As suggested by one of the commenters, I tried jabref first:
jabref -n -f "citeseer:title:(lessons from) author:(Brewer)"
However jabref seems to not realize that the query string needs to contain colons and so throws an error.
For search results, I ended up scraping the CiteSeerX results with Python's BeautifulSoup:
url = "http://citeseerx.ist.psu.edu/search?q="
q = "title%3A%28{1}%29+author%3%28{0}%29&submit=Search&sort=cite&t=doc"
url += q.format (author_last, title.replace (" ", "+"))
soup = BeautifulSoup (urllib2.urlopen (url).read ())
result = soup.html.body ("div", id = "result_list") [0].div
title = result.h3.a.string.strip ()
authors = result ("span", "authors") [0].string
authors = authors [len ("by "):].strip ()
date = result ("span", "pubyear") [0].string.strip (", ")
It is possible to get a document ID from the results (the misleadingly-named "doi=..." part in the summary link URL) and then pass that to the CiteSeerX OAI engine to get Dublin Core XML (e.g. http://citeseerx.ist.psu.edu/oai2?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:CiteSeerX.psu:10.1.1.42.2177); however that XML ends up containing multiple dc:date elements, which makes it less useful than the scrape output.
Too bad CiteSeerX makes people resort to scraping in spite of all the open archives / open access rhetoric.

How to get the result of "all pages with prefix" using Wikipedia api?

I wish to use Wikipedia api to extract the result of this page:
http://en.wikipedia.org/wiki/Special:PrefixIndex
When searching "something" on it, for example this:
http://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=tal&namespace=4
Then, I would like to access each of the resulting pages and extract their information.
What api call might I use?
You can use list=allpages and specify apprefix. For example:
http://en.wikipedia.org/w/api.php?format=xml&action=query&list=allpages&apprefix=tal&aplimit=max
This query will give you the id and title of each article that starts with tal. If you want to get more information about each page, you can use this list as a generator:
http://en.wikipedia.org/w/api.php?format=xml&action=query&generator=allpages&gapprefix=tal&gaplimit=max&prop=info
You can give different values to the prop parameter to get different information about the page.