I'm wondering what the easiest way to extract only the information contained withing a certain template would be using the wikimedia api.
I'd like to extract the information contained in the template "Template:Mycomorphbox" for this page: http://en.wikipedia.org/wiki/Amanita_phalloides
I'm a bit frustrated that it seems like I have to pull the entire content of the page to get the information that I need. Surely there has to be a better way.
Indeed there is a better way. You must not extract information from templates (or from wikitext in general). That's not your job nor your application's, it's MediaWiki's.
Use Wikidata, which is where the structured information from and for Wikipedia is stored. See the Wikibase API documentation and see some of the properties used for biology stuff or ask if something is unclear.
Related
I am trying to search by section with the Wikipedia API.
What I already know:
For the below:
https://en.wikipedia.org/w/api.php?format=xml&action=query&prop=revisions&titles=Game_of_Thrones_(season_1)&rvprop=content&rvsection=0
I know the rvsection=0 will give me section 0 of the Wikipedia page and I can change this to get different sections of the page eg. 1,2,3.
What I am wondering is how/if I can search via section name? Eg. In the link above on the Wikipedia page there is a section named "Episodes", how can I search for this, so I get all the content from this section.
If this is not possible, is there a work around for this? What I am wanting to do is get Episode information from different Wikipedia pages.
I have done some more researching into this and have sort of found a solution.
If we want to get a certain section then we need to query this information with the API below:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=(NAME_OF_WIKI_PAGE)&prop=sections&disabletoc=1
This will give us JSON info about names of each section.
Once we have section info, use the parse API to get the wikitext. If we want the HTML, we can change prop to text:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=(NAME_OF_WIKI)&prop=wikitext§ion=(SECTION #)&disabletoc=1
As a result, we get the specific section we want formatted in JSON. The next step for me is sorting this and trying to get this HTML/wiki text into plain text.
I'm looking at the API documentation here,
https://www.mediawiki.org/wiki/API:Query
Getting the wikitext for a page is mentioned in the beginning of the documentation,
The action=query module allows you to get information about a wiki and the data stored in it, such as the wikitext of a particular page, the links and categories of a set of pages, or the token you need to change wiki content.
but I cant seem to figure out what parameters to pass in the API request to return the wikitext for a given page. Anyone know how to do this?
I've tried parameters like,
{'action':'query', 'titles':'Anarchism', 'prop':'wikitext', 'format':'json'}
You must use this query .
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Anarchism&rvslots=main
As you surely know web.archive.org lets you inspect the history of a domain, ie:http://web.archive.org/web/*/besttatoo.com
I also has an API: http://archive.org/help/json.php
I need to get data from the API but I can't get many info on how to use it, has anyone used it and can paste some examples of use?
This link provides details about the item LovingU on archive.org:
http://archive.org/details/LovingU&output=json
To create an API query to your liking, use this page:
https://archive.org/advancedsearch.php#raw
That page allows you to choose your output format: JSON, XML, HTML, CSV or RSS and also the parameters your want to see. You can limit the number of results, too.
I would like to create a bot. Someone would type "!123" the bot will search the repository for the value "123" and return(paste) the information found for that value back. I'd like this to be universal..meaning it can be used anywhere, so some sort of firefox plugin maybe.
Can someone provide me with information on where i can start?
I have an understanding of programming in c# and java.
P.s There is no intention for this to be some sort of spam bot, i just want to have a collection of information where people can easily reference it.
there are multiple portions to your project.
Bot that would crawl the data from the web and save the data in the db. (given you are considering to build your repository from web). Google Web Crawler/scraper for that.
Data extractor/Cleanser that would clean the data and extract relevant information about a particular document. (this is important so that you could tag the information for relevant information)
Then is the Search Engine part which enables you to retrieve relevant data from the repository. try vector similarity algorithm for that
I'd like a way to download the content of every page in the history of a popular article on Wikipedia. In other words I want to get the full contents of every edit for a single article. How would I go about doing this?
Is there a simple way to do this using the Wikipedia API. I looked and didn't find anything the popped out as a simple solution. I've also looked into the scripts on the PyWikipedia Bot page (http://botwiki.sno.cc/w/index.php?title=Template:Script&oldid=3813) and didn't find anything that was useful. Some simple way to do it in Python or Java would be the best, but I'm open to any simple solution that will get me the data.
There are multiple options for this. You can use the Special:Export special page to fetch an XML stream of the page history. Or you can use the API, found under /w/api.php. Use action=query&title=$TITLE&prop=revisions&rvprop=timestamp|user|content etc. to fetch the history.
Pywikipedia provides an interface to this, but I do not know by heart how to call it. An alternative library for Python, mwclient, also provides this, via site.pages[page_title].revisions()
Well, one solution is to parse the Wikipedia XML dump.
Just thought I'd put that out there.
If you're only getting one page, that's overkill. But if you don't need the very very latest information, using the XML would have the advantage of being a one-time download instead of repeated network hits.