Functions on Wikipedia dump file - wikipedia-api

We can use the functions from Wikipedia API to getting some results from Wikipedia.
For example:
**import Wikipedia
print(Wikipedia.search("Bill", results=2)).**
My question, how can I use Wikipedia API functions for a specific version of Wikipedia (e.g. just Wikipedia 2017)?!!

I doubt that this is possible. PyWikibot is using the online API of MediaWiki (in this case for the site Wikipedia). This one is always the live data.
The dumps, which you mention, are offline snapshots of the data of Wikipedia (assuming you're talking about https://dumps.wikimedia.org/). This data is not connected to the MediaWiki API in any way and can therefore not being queried with it.
What you can do to go through the data of Wikipedia in a specific time:
If it's a limited number of pages only: You could write a script which goes through the available revisions of the page and selects the one, that is closest to the time you want. That's probably error prone, a lot of work and does not really scale
Download the dump you want to query on and write a script which can work on the files (e.g. the database dump or the static html dump depending on what you want to do, that's not really clear from your question)

On the dump file with specific version, we can not use Wikipedia API. We just can read the dump file by our code and make what we need on this file.

Related

Picking the right database technique for file storage and search

for a personal project I am searching for the "most suitable" database engine to hit the following key issues
need to store large amounts of single different document files (PDF)
need to perform full-text search onto PDF (for this I plan to use OCR and save the processed data/metadata additionally to the database)
need to get pieces/chunks of the saved documents (for example from a specific year) and show a preview of lots of them within a nice web UI
as much performance as possible
Up to now I did work a lot with SQL (MySql) and have some theoretical knowledge about other systems (MemCached, Redis, PostgreSQ, MongoDb). But I`ve never used them in combination and never hit the point WHEN they should be used for WHAT exactly or how they can be combined.
I think especially for a project like this it`s very important to select the right engine from beginning not to hit performance issues later.
So especially to all experienced developers out there, what would be your favourite choiche for this kind of (I gues SQL may not be the only right solution) ?
Or at the end will it be better to store files within filesystem and keep only metadata in database ?
BTW my planned API backend for this is Laravel 7+, frontend will be Vue 2+.
Thank you very much !

How do you implement search over static content within cshtml files

I am using asp.net core and Razor - and as it is a help system I would like to implement some kind of search facility to bring back a list of results hyperlinked based on the search terms.
I would like the search to iterate essentially over the content contained within the and tags and then link this to the appropriate page/view.
What is the best way to do this?
I'm not even sure how you get a handle on the actual content of your own cshtml pages and then go from there.
This question is far too broad. However, I can provide you some pointers.
First, you need to determine what you're actually wanting to surface and where that data lives. Your question says "static web pages", but then you mention .cshtml. Traditionally, when it comes to creating your own search, you're going to have access to some particular dataset (tables in a database, for example). It's much simpler to search across the more structured data than the end result of it being dumped in various and sundry places over a web page.
Search engines like Google only index in this way because they typically don't have access to the raw data (although some amount of "access" can be granted via things like JSON-LD and other forms of Schema.org markup). In other words, they actually read from the web page out of necessity, because that's what they have to work with. It's certainly not the approach you would take if you have access to the data directly.
If for some reason you need to actually spider an index your own site's HTML content, then you'll essentially have to do what the big boys do: create a bot, run it on a schedule, crawl your site, link by link, downloading each document, and then parse and process it. The end result would be to create a set of structured data that you can actually query against, which is why all this is pretty much just wasted effort if you already have that data.
Once you have the data, however you got there, you simply query it. In the most basic of forms, you could store it in a table in a database and literally issue SQL queries against it. Your search keywords/parameters are essentially the WHERE of the SELECT statement, so you'd have to figure out a way to map the keywords/parameters you're receiving to an acceptable WHERE clause that achieves that.
More traditionally, you'd use an actual search engine: essentially a document database that is designed and optimized for search, and generally provides a more search-appropriate API to query against. There's lots of options in this space from roll your own to hosted SaaS solutions, and anywhere in between. Of course the cost meter goes down the more work you have to do and goes up the more out of the box it is.
One popular open-source and largely free option is Elasticsearch. It uses Lucene indexes, which it stitches to together in a clustered environment to provide failover and scale. Deployment is a beast, to say the least, though it's gotten considerably better with things like containerization and orchestration. You can stand up an Elasticsearch cluster in something like Kubernetes with relative ease, though you still will probably need to do a bit of config. Elasticsearch does also have hosted options, but you know, cost.

Apache Mahout as Recommendation Engine

I want to use Apache Mahout as Recommendation Engine; but over here I found that it force us to use its own table called taste_preferences with only 3-4 columns and data type as number(Long/big int). Is it mandatory to use this table and store data in number format only.
That is one way to build a recommendation engine, but there are simpler ways as well.
There is a small book available for free from
http://www.mapr.com/practical-machine-learning
which explains a way to deploy recommendation engines on top of a search engine. This requires an off-line analysis to build the data that gets put into the search engine, but once you have the indicator data in the search engine, you can do recommendations using search queries. These queries are not textual queries, but instead use past behavior as a query.
You can also see slides describing the approach here:
http://www.slideshare.net/tdunning/building-multimodal-recommendation-engines-using-search-engines
and here:
http://www.slideshare.net/tdunning/using-mahout-and-a-search-engine-for-recommendation
The book is easier to understand than the slides without the narrative, but both are likely useful since the slides have more details.

How to analyse Wikipedia article's data base with R?

This is a "big" question, that I don't know how to start, so I hope some of you can give me a direction. And if this is not a "good" question, I will close the thread with an apology.
I wish to go through the database of Wikipedia (let's say the English one), and do statistics. For example, I am interested in how many active editors (which should be defined) Wikipedia had at each point of time (let's say in the last 2 years).
I don't know how to build such a database, how to access it, how to know which types of data it has and so on. So my questions are:
What tools do I need for this (besides basic R) ? MySQL on my computer? RODBC database connection?
How do you start planning for such a project?
You'll want to start here:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Which will take you to here:
http://download.wikimedia.org/enwiki/20100312/
And the file you probably want is:
# 2010-03-17 04:33:50 done Log events to all pages.
* This contains the log of actions performed on pages.
* pages-logging.xml.gz 1.0 GB
http://download.wikimedia.org/enwiki/20100312/enwiki-20100312-pages-logging.xml.gz
You'll then import the xml into MySQL. Generating a histogram of users per day, week, year, etc. won't require R. You'll be able to do that with a single MySQL query. Something like:
select DAYOFYEAR(wiki_edit_timestamp), count(*)
from page_logs
group by DAYOFYEAR(wiki_edit_timestamp)
order by DAYOFYEAR(wiki_edit_timestamp);
etc.
(I'm not sure what their actual schema is, but it'll be something like that.)
You'll run into issues, no doubt, but you'll learn a lot too. Good luck!
You could
work with the wikipedia database dumps, as already mentioned
work with the live mediawiki API, see this minimal example at Rosettacode or my unfinished approach with a S3 class or this package by Peter Konings
work with dbpedia, an effort to extract knowledge from wikipedia into a knowledge base. They offer an online sparql access I don't know much about, and also datasets as n-triples for download. See this python script which might be a starting point for an R script. This approach might be useful to access the content stored in the wikipedia (such as the infoboxes) but I am not sure if information on contributors to the wikipedia is available.
Try WikiXRay (Python/R) and zotero.

Wiki Database, is there one?

I was searching the net for something like a wiki database, just like wikipedia but instead stores structured content, editable by users. What I was looking for was an online database accessible by everyone where people can design the schema and data with proper versioning of both schema and data. I couldn't find any such site. I am not sure if it is my search skills or if there really is no wiki database as of now. Does anyone out there know anything like this?
I think there is a great potential for something like this. A possible example will be a website with a GUI for querying a MySQL DB where any website visitor can create DB objects and populate data.
UPDATE: I had registered the domain wikidatabase.org to get started on a tool but I didn't find enough time yet. If anyone is interested in spending some time and coding on this, please let me know at wikidatabase.org
It's not quite what you're looking for, but Semantic Mediawiki adds database-like features to MediaWiki:
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki
It's still fundamentally a Wiki, but you can add semantic tags to pages ([[foo::bar]] [[baz::1000]]) and then do database-type queries across them: SELECT baz FROM pages WHERE foo=bar would be {{#ask: [[foo::bar]] | ?baz}}. There is even an embryonic SPARQL implementation for pseudo-SQL queries.
OK this question is old, but Google led me here, so for anyone else out there looking for a wiki for structured data: Take a look at Foswiki.
This might be like what you're looking for: dbpedia.org. They're working on extracting data from Wikipedia, and encoding it in a structured format using RDF, so that it can be queried using SPARQL.
Linkeddata.org has a big list of RDF data sets.
Do you mean something like http://www.freebase.com?
You should check out https://www.wikidata.org/wiki/Wikidata:Main_Page which is a bit different but still may be of interest.
Something that might come close to your requirements is Google Docs.
What's offered is document editing roughly similar to MS Word, and spreadsheets roughly similar to Excel. I'm thinking of the latter, of course.
In Google Docs, You can create spreadsheets for free; being spreadsheets, they naturally have a row-and-column structure similar to a database, and which you can define flexibly. You can also share these sheets with other people. This seems to be a by-invite-only process rather than open-to-all, but there may be other possibilities I'm not aware of, or that level of sharing might be enough for you in any case.
mindtouch should be able to do it. It's rather easy to get data in / out. (for example: it's trivial to aggregate all the IP's for servers into one table).
I pretty much use it as a DB in the wiki itself (pages have tables, key/value..inheritance, templates, etc...) but you can also interface with the API, write dekiscript, grab the XML...
I like this idea. I have heard of some sites that are trying to pull together large datasets for various things for open consumption, but none that would allow a wiki feel.
You could start with something as simple as an installation of phpMyAdmin with a known password that would allow people to log in, create a database, edit data and query from any other site on the web.
It might suffer from more accuracy problems than wikipedia though.
OpenRecord, development of which seems to have halted in 2008, seems to approach this. It is a structured wiki in which pages are views on the data. Unlike RDBMSes it is loosely typed - the system tries to make a best guess about what data you entered, but defaults to text when it cannot guess. Schemas appear to have been implied.
http://openrecord.org
An example of the typing that is given is that of a date. If you enter '2008' in a record, the system interprets this as a date. If you enter 'unknown' however, the system allows that as well.
Perhaps you might be interested in Couch DB:
Apache CouchDB is a document-oriented
database that can be queried and
indexed in a MapReduce fashion using
JavaScript. CouchDB also offers
incremental replication with
bi-directional conflict detection and
resolution.
I'm working on an Open Source PHP / Symfony / PostgreSQL app that does this.
It allows multiple projects, each project can have multiple directories, each directory has a defined field structure. Admins set all this up.
Then members of the public can suggest new records, edit or report existing ones. All this is moderated and versioned.
It's early days yet but it basically works and is already in real world use in several projects.
Future plans already in progress include tools to help keep the data up to date, better searching/querying and field types that allow translations of content between languages.
There is more at http://www.directoki.org/
I'm surprised that nobody has mentioned Wikibase yet, which is the software that powers Wikidata.