Wikipedia data extraction - sql

I am trying to populate some tables with Hindi Wikipedia data. I have to populate it with article titles, their categories and their corresponding English url.
Right now I am finding the category and English url by parsing the html file and locating the particular div tag. This is taking a lot of time. Is there any direct and efficient way to populate the categories. Do let me know.
I have downloaded hindi wikipedia from the link: ftp://wikipedia.c3sl.ufpr.br/wikipedia/hiwiki/20131201/

You could either use some sort of parsing engine like Wikiprep: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/
Or you could use the MediaWiki engine to handle the Wiki markup language.
http://www.mediawiki.org/wiki/Manual:Importing_XML_dumps
There might be some other options that might be relevant to your case, you can check out also here:
http://en.wikipedia.org/wiki/Wikipedia:Database_download#Help_importing_dumps_into_MySQL
(I've personally used options #1 and #2)

Related

MediaWiki api for Wikipedia - is it possible to search by title on ALL languages?

I know that to search for a page id of a wikipedia with known title, i can do:
https://en.wikipedia.org/w/api.php?action=query&titles=7_Studios
However, in this case, 7_Studios is a french wikipedia article, so the above link would not work. Instead I need to try
https://fr.wikipedia.org/w/api.php?action=query&titles=7_Studios
My question is, if I do not know what language the article is about but only the title itself, how can it make sure i can find it using the api?
As Bergi mentioned, you can use Wikidata for this: it contains the database of interwiki links, so it's possible some article title won't be there, but most should.
To do this, you can use the wbgetentities module: you specify the title to search for and a list of wikis to search. For example:
https://www.wikidata.org/w/api.php?action=wbgetentities&titles=7_Studios&sites=enwiki|frwiki|nlwiki|dewiki
You can specify up to 50 wikis in one query. Currently, there are around 300 Wikipedias, so if you really need to query all of them, you may need up to 6 requests for each title.

How to get information in info box of Wikipedia articles using Wikipedia api?

I'm trying to get lead actor's name from movie's Wikipedia article.
I tried different values for prop, prop=info seems most relevant. But this doesn't contain the information in info box of Wikipedia article.
See:
http://en.wikipedia.org/w/api.php?action=query&prop=info&titles=Casino_Royale_(2006_film)&format=jsonfm
Is it possible to extract information in infobox using Wikipedia API?
The MediaWiki API doesn't understand infoboxes. So, you have basically two options:
Parse the infobox yourself. You can either parse the wikitext directly or the generated HTML table (both are available from the API).
Let somebody else do the parsing. This is exactly what DBPedia does. Wikidata tries to do something similar, but it probably won't contain enough data to be usable for a long time; see growth statistics.

Tagging documents with predefined labels

I am working with large number of documents and have a set of predefined categories/tags(could be phrases) that would be present in the text of the documents either in the exact or inexact form.
I want to assign each document to exactly one tag among the tags that is closest to its text.
Please give me some directions as to what should I do to address this problem.
You can look at the lucene search engine that tags the documents while indexing. Northernlight search engine used to do a similar task mentioned by you in their searching methodology. You can have a look at its implementation in order to get an idea.

Hiding or Promoting specific content within a page to search engines

A bit of an SEO question here.
I've got a site with a ton of pages, of content. I know lots of the content is the same on each page.
I thought that Search Engines keyed off of the differences in page content so that they could promote the correct data, but when I look at the summary in google and bing, the summary shows my 'feedback' block (which is where I just ask for feedback).
Yahoo (and the summary in Facebook) shows my search options menu.
These aren't really things that are going to make a person want to click on the page.
So I'm wondering what the best way is to either hide this content from search engines, or improve the visibility of the other content that should get indexed.
The page structure is pretty consistent, so I thought it would have been easy for the search robots to pick this stuff out, but apparently not.
You may want to try using a meta tag like this.
< META NAME="description" CONTENT="Here is a short summary of the page" >
Search engines also prefer title and header tags over regular text.
Meta is the best way to do that.
However,Beware that your structure of page is a also important, which means search engines prefer to use metal tag, but they also weigh the structures, keywords, headers things like that.
I encountered such trouble couple of months ago. I found Google showed price and download rather than meta description. I solved that by reorganize meta description(more accurate and shorter,177 characters)eliminate tags from price and download tags. And made some slight adjustments to the structure. Now the Google summary is what I want.
Hope this helps you!

Tool or methods for automatically creating contextual links within a large corpus of content?

Here's the basic scenario - I have a corpus of say 100,000 newspaper-like articles. Minimally they will all have a well-defined title, and some amount of body content.
What I want to do is find runs of text in articles that ought to link to other articles.
So, if article Foo has a run of text like "Students in 8th grade are being encouraged to read works by John-Paul Sartre" and article Bar is titled (and about) "The important works of John-Paul Sartre", I'd like to automagically create that HTML link from Foo to Bar within the text of Foo.
You should ask yourself something before adding the links. What benefit for users do you want to achieve by doing this? You probably want to increase the navigability of your site. Maybe it is better to create an easier way to add links to older articles in form used to submit new ones. Maybe it is possible to add a "one click search for selected text" feature. Maybe you can add a wiki-like functionality that lets users propose link for selected text. You probably want to add links to related articles (generated through tagging system or text mining) below the articles.
Some potential problems with fully automated link adder:
You may need to implement a good word sense disambiguation algorithm to avoid confusing or even irritating the user by placing bad automatic links with regex (or simple substring matching).
As the number of articles is large you do not want to generate the html for extra links on every request, cache it instead.
You need to make a decision on duplicate titles or titles that contain other title as substring (either take longest title or link to most recent article or prefer article from same category).
TLDR version: find alternative solutions that provide desired functionality to the users.
What you are looking for are text mining tools. You can find more info and links at http://en.wikipedia.org/wiki/Text_mining. You might also want to check out Lucene and its ports at http://lucene.apache.org. Using these tools, the basic idea would be to find a set of similar articles based on the article (or title) in question. You could search various properties of the article including titles and content or both. A tagging system a la Delicious (or Stackoverflow) might also be helpful. Rather than pre-creating the links between articles, you'd present the relevant articles in an interface much like the Related questions interface on the right-hand side of this page.
If you wanted to find and link specific text in each article, I think you'd need to do some preprocessing to select pertinent phrases to key on. Even then I think it would be very hard not to miss things due to punctuation/misspellings or to not include irrelevant links for the same reasons.