How to get information in info box of Wikipedia articles using Wikipedia api? - wikipedia-api

I'm trying to get lead actor's name from movie's Wikipedia article.
I tried different values for prop, prop=info seems most relevant. But this doesn't contain the information in info box of Wikipedia article.
See:
http://en.wikipedia.org/w/api.php?action=query&prop=info&titles=Casino_Royale_(2006_film)&format=jsonfm
Is it possible to extract information in infobox using Wikipedia API?

The MediaWiki API doesn't understand infoboxes. So, you have basically two options:
Parse the infobox yourself. You can either parse the wikitext directly or the generated HTML table (both are available from the API).
Let somebody else do the parsing. This is exactly what DBPedia does. Wikidata tries to do something similar, but it probably won't contain enough data to be usable for a long time; see growth statistics.

Related

How to reconcile in OpenRefine by Wikipedia article title?

I want to reconcile a large number of records, of which I have the exact Wikipedia article titles (including parenthetical disambiguation). What is the best/fastest way to match this large number of records based on their exact Wikipedia title in OpenRefine? If I simply reconcile by text, the confidence is low and Wikidata entries with the same title get mixed up.
Transform your values into Wikipedia URLs, for instance with the following GREL formula (assuming all articles are on the English Wikipedia):
'https://en.wikipedia.org/wiki/'+value
You can then reconcile this column with the Wikidata reconciliation service, which will recognize these URLs and resolve the Wikidata items via site links.
If your article titles contain disambiguation pages, reconciliation will give you disambiguation items, so it is a good practice to double-check their type (P31) by fetching it after reconciliation.
I think you are approaching from the opposite direction. Use #Wikidata numbers, which are also available for the disambiguation pages! The Wikidata item is on the left side pane. It provides disambiguation and is language neutral and queryable. Every Wikipedia entry has a Wikidata entry.
There might also be a SPARQL query that would do this work for you. If you ask some of the Wikidatans they can help. Try #wikidatafacts on Twitter.
If you need non-linked text included, which might be in some of the disamb page list, the manual nature of Wikipedia won’t help you. But you could spot check for those outliers.

Functions on Wikipedia dump file

We can use the functions from Wikipedia API to getting some results from Wikipedia.
For example:
**import Wikipedia
print(Wikipedia.search("Bill", results=2)).**
My question, how can I use Wikipedia API functions for a specific version of Wikipedia (e.g. just Wikipedia 2017)?!!
I doubt that this is possible. PyWikibot is using the online API of MediaWiki (in this case for the site Wikipedia). This one is always the live data.
The dumps, which you mention, are offline snapshots of the data of Wikipedia (assuming you're talking about https://dumps.wikimedia.org/). This data is not connected to the MediaWiki API in any way and can therefore not being queried with it.
What you can do to go through the data of Wikipedia in a specific time:
If it's a limited number of pages only: You could write a script which goes through the available revisions of the page and selects the one, that is closest to the time you want. That's probably error prone, a lot of work and does not really scale
Download the dump you want to query on and write a script which can work on the files (e.g. the database dump or the static html dump depending on what you want to do, that's not really clear from your question)
On the dump file with specific version, we can not use Wikipedia API. We just can read the dump file by our code and make what we need on this file.

Extract related articles in different languages using Wikidata Toolkit

I'm trying to extract interlanguage related articles in Wikidata dump. After searching on the internet, I found out there is a tool named Wikidata Toolkit that helps to work with these type of data. But there is no information about how to find related articles in different languages. For example, the article: "Dresden" in the English language is related to the article: "Dresda" in the Italiano one. I mean the second one is the translated version of the first one.
I tried to use the toolkit, but I couldn't find any solution.
Please write some example about how to find this related article.
you can use Wikidata dump [1] to get a mapping of articles among wikipedias in multiple language.
for example if you see the wikidata entry for Respiratory System[2] at the bottom you see all the articles referring to the same topic in other languages.
That mapping is available in the wikidata dump. Just download wikidata dump and get the mapping and then get the corresponding text from the wikipedia dump.
You might encounter some other issues, like resolving wikipedia redirects.
[1] https://dumps.wikimedia.org/wikidatawiki/entities/
[2] https://www.wikidata.org/wiki/Q7891

Why is some information from the Wikipedia infobox missing on DBpedia?

Why is some information from the Wikipedia infobox missing on DBpedia?
For example, star Alpha Librae has property distance-from-earth in the infobox, but it isn't a property of the Alpha Librae dbpedia resource. On the other hand, star Betelgeuse has this piece of information on DBpedia). And many other stars have this distance information in the infobox, but there isn't any matching property in the DBpedia resource.
Is there a way to extract thise missing information from DBpedia using SPARQL or is the only way web scraping of the wiki page?
The DBpedia pages present all the data DBpedia has -- no SPARQL nor other query can get data that isn't there.
DBpedia is updated periodically. It may not reflect the latest changes on Wikipedia.
Also, extractors are a living project, and may not grab every property in which you're interested.
Looking at Betelgeuse on Wikipedia, I see one distance in the infobox. Looking at Alpha_Librae, I see two distances. Which should DBpedia have? Perhaps you have the niche knowledge which can ensure that the extractors do the right thing...
As #JoshuaTaylor suggests, you will probably get more satisfactory answers from the DBpedia discussion list and/or the DBpedia development list.
Look at en.wikipedia.org/wiki/Volkswagen_Golf_Mk3:
In the infobox you have:
height = 1991-95 & Cabrio: {{convert|1422|mm|in|1|abbr=on}}1996-99: {{convert|1428|mm|in|1|abbr=on}}
In dbpedia you get height=1991-95
instead of
height=1442
height=1428
This happens because there is no standard how to define properties in a conditional way. For this reason, dbpedia properties are sometimes wrong/missing

Wikipedia data extraction

I am trying to populate some tables with Hindi Wikipedia data. I have to populate it with article titles, their categories and their corresponding English url.
Right now I am finding the category and English url by parsing the html file and locating the particular div tag. This is taking a lot of time. Is there any direct and efficient way to populate the categories. Do let me know.
I have downloaded hindi wikipedia from the link: ftp://wikipedia.c3sl.ufpr.br/wikipedia/hiwiki/20131201/
You could either use some sort of parsing engine like Wikiprep: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/
Or you could use the MediaWiki engine to handle the Wiki markup language.
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps
There might be some other options that might be relevant to your case, you can check out also here:
http://en.wikipedia.org/wiki/Wikipedia:Database_download#Help_importing_dumps_into_MySQL
(I've personally used options #1 and #2)