How can I get the full change history for an article on Wikipedia? - scripting

I'd like a way to download the content of every page in the history of a popular article on Wikipedia. In other words I want to get the full contents of every edit for a single article. How would I go about doing this?
Is there a simple way to do this using the Wikipedia API. I looked and didn't find anything the popped out as a simple solution. I've also looked into the scripts on the PyWikipedia Bot page (http://botwiki.sno.cc/w/index.php?title=Template:Script&oldid=3813) and didn't find anything that was useful. Some simple way to do it in Python or Java would be the best, but I'm open to any simple solution that will get me the data.

There are multiple options for this. You can use the Special:Export special page to fetch an XML stream of the page history. Or you can use the API, found under /w/api.php. Use action=query&title=$TITLE&prop=revisions&rvprop=timestamp|user|content etc. to fetch the history.
Pywikipedia provides an interface to this, but I do not know by heart how to call it. An alternative library for Python, mwclient, also provides this, via site.pages[page_title].revisions()

Well, one solution is to parse the Wikipedia XML dump.
Just thought I'd put that out there.
If you're only getting one page, that's overkill. But if you don't need the very very latest information, using the XML would have the advantage of being a one-time download instead of repeated network hits.

Related

Extracting information from specific template on wikimedia api

I'm wondering what the easiest way to extract only the information contained withing a certain template would be using the wikimedia api.
I'd like to extract the information contained in the template "Template:Mycomorphbox" for this page: http://en.wikipedia.org/wiki/Amanita_phalloides
I'm a bit frustrated that it seems like I have to pull the entire content of the page to get the information that I need. Surely there has to be a better way.
Indeed there is a better way. You must not extract information from templates (or from wikitext in general). That's not your job nor your application's, it's MediaWiki's.
Use Wikidata, which is where the structured information from and for Wikipedia is stored. See the Wikibase API documentation and see some of the properties used for biology stuff or ask if something is unclear.

Does google index script tags as content when using handlebars.js

If you use the standard handlesbar.js implementation, does Google view the content within the custom script tags as content, script or unknown content?
If you're in doubt, do in pure HTML. Unfortunately, Google should ignore this. I looked about, and all I heard is that this application was not made ​​to be searchfriendly.
In fact, Google undestand and even follow links created via Javascript, but handlebarsjs is very more complex.
Possible solution
A strong suggestion that I make to you is load a simplified version with some content in plain simplified and after use handlebarsjs, so at then at least do not let google completely blind. But thsi version should be used also to end user, because google Will know if you show a diferent content just for Googlebot.
Possible solution 2
Exist a way to make websites that rely heavily on AJAX still work in Making AJAX Applications Crawlable

Umbraco: A/B testing, links in structure

I'm having a problem when trying to A/B test certain nodes in my node-tree in Umbraco.
What I want to do is to copy a node in the node-tree to a specific spot and use that B-structure to see which of the structures works best, using Google analytics.
For example we have two node structures, let's call them "Private" and "Sweden".
Their structure with childnodes and properties are exactly the same. The only difference between them is the propertyvalues (content). The "Private"-URL is www.mysite.com/Private and the "Sweden"-URL is www.mysite.com/Sweden.
What I would like to do is to change every link on the B-structure, so that it points to its match at the A-structure. The problem is that since it's two different structures, it will have two different alternative links.
With other words, it should be a coinsidence that it enters the B-structure, then be moved back to the A-structure in the next click.
We manage what page it should load (either the A-node or the B-node) with scripts, so that it has a 50% chance for each node, and if it lands on the B-node, Google analytics will save data. What we can't manage is that every link on that page will be to the A-node.
I'd appreciate any help I can get.
Regards,
David
There's a couple of ways that seem likely to give you a start at least.
The /config/urlrewriting.config file allows you to set up multiple redirect rules within umbraco so a section like the following might work in sending all requests (whether (/sweden/pagename/ or /private/pagename/) back to the private structure. Not sure how GA will handle it:
rewriteUrlParameter="ExcludeFromClientQueryString" destinationUrl="http://www.mysite.com/private/$1" redirect="Domain" redirectMode="Permanent" ignoreCase="true" />
Secondly a simple httpmodule (http://support.microsoft.com/kb/307996) can process all page requests and redirect as required - you could do a gaq_push here directly or indirectly.
I'd be interested to know how you get on - it seems a good area for extension to Umbraco.
I'm not sure I have understood perfectly what you need to do, so please excuse any assumptions that may prove mistaken. Here's what I think:
Since A & B nodes should share the same html content (besides the links of course), why don't you make the link href attribute dynamic by using a bit of razor in the template or macro:
#{var isANode = CurrentPage.Parent.Name == "Sweden"; }
A similar approach would work if you are using web forms.
We finally came to the final decision to use the alternative template-solution. Since there seem to be no generic solution for my case of this problem we had to create an alternative template with specific macros to render the different information for every documenttype we're using.
Creating dynamic links for every page is a hell of a job in this stage in the project, since there are so many pages and links. Also some links are made in javascript, so there's another problem.
I copied the a-structure to another node, only for the reason to be able to change propertyvalues. There might be a problem logging and track the information with Google Analytics though, so that's the next step for us in this project. In our alternative templates we're getting the propertyvalues from the b-structure.
Still, if anyone have some better solution I'd highly appreciate it!
Regards,
David

Normal Google Custom Search

I'm writing an application that analyses search engine results.
With the Google Search API now being depreciated and limited to 1000 queries/day they are forcing developers to move to the AJAX APIs and to use the Custom Search API to do a Google search.
The thing is I don't need a Custom Search, I need a general search not one that is filtered by site; OK maybe filtered by USA/UK (Google.com/Google.co.uk).
Does anyone know how to just do a regular Google search using the AJAX APIs? Is the Custom Search the right thing to be using?
I don't want to hit the 1000/day limit using the old service but this is exactly what I need.
I did find: How do I create a CSE that searches the entire web?
http://www.google.com/support/customsearch/bin/answer.py?hl=en&answer=1210656
But by the sounds of it this will distort the search results.
Thank you.
OK. Here's how I think it is done.
Create a Custom Search Engine.
Add a site such as *.com When this is created go to the Advanced tab
and download the context xml.
Remove the Background Label associated with the site.
Upload the XML to replace the previous context.
This seems to work just fine and is returning the same values as far as I can see.
Yes, you are right *in theory, and this should let you get 100 results a day on the fly. Just this Saturday though, Google confirmed how here -
(* so far though, we can't get it working...)

Automate adding entries to a wiki

Once I have my renamed files I need to add them to my project's wiki page. This is a fairly repetitive manual task, so I guess I could script it but I don't know where to start.
The process is:
Got to appropriate page on the wiki
for each team member (DeveloperA, DeveloperB, DeveloperC)
{
for each of two files ('*_current.jpg', '*_lastweek.jpg')
{
Select 'Attach' link on page
Select the 'manage' link next to the file to be updated
Click 'Browse' button
Browse to the relevant file (which has the same name as the previous version)
Click 'Upload file' button
}
}
Not necessarily looking for the full solution as I'd like to give it a go myself.
Where to begin? What language could I use to do this and how difficult would it be?
Check if the wiki you mean to talk to supports XMLRPC, because if it does it should be a snap. I wrote a tool called WikiUp to solve a similar problem (updating a delineated section on a wiki page).
If you're writing in C#, the WebClient classes might be a good place to start. I bet people could give more specific advice if you mentioned which wiki platform you are using, and whether it requires authentication, though.
I'd probably start by downloading fiddler and watching the http requests from doing it manually. Then you could use some simple scripts and regexes to build your http requests for automating the process.
Of course, if your wildly lucky, your wiki would have a backend simple enough that you could just plug them into its db directly. :)
You might find CoScripter useful -- it's a Firefox extension that allows you to automate tasks you perform on websites. I'm not certain how you'd integrate this with the list of files you're changing on your local system, but it can certainly handle the file uploading through a web form.
Better bet is probably using cURL or a similar HTTP library with your programming language of choice. If you're on *nix, you can use the cURL commandline program inside your shell script to get this done fairly easily. (Like #jsight said you will need to analyze the actual forms you're using on the webpage, using Fiddler or just looking at the form elements and re-creating the POST through cURL.)