Wiki Database, is there one? - sql

I was searching the net for something like a wiki database, just like wikipedia but instead stores structured content, editable by users. What I was looking for was an online database accessible by everyone where people can design the schema and data with proper versioning of both schema and data. I couldn't find any such site. I am not sure if it is my search skills or if there really is no wiki database as of now. Does anyone out there know anything like this?
I think there is a great potential for something like this. A possible example will be a website with a GUI for querying a MySQL DB where any website visitor can create DB objects and populate data.
UPDATE: I had registered the domain wikidatabase.org to get started on a tool but I didn't find enough time yet. If anyone is interested in spending some time and coding on this, please let me know at wikidatabase.org

It's not quite what you're looking for, but Semantic Mediawiki adds database-like features to MediaWiki:
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki
It's still fundamentally a Wiki, but you can add semantic tags to pages ([[foo::bar]] [[baz::1000]]) and then do database-type queries across them: SELECT baz FROM pages WHERE foo=bar would be {{#ask: [[foo::bar]] | ?baz}}. There is even an embryonic SPARQL implementation for pseudo-SQL queries.

OK this question is old, but Google led me here, so for anyone else out there looking for a wiki for structured data: Take a look at Foswiki.

This might be like what you're looking for: dbpedia.org. They're working on extracting data from Wikipedia, and encoding it in a structured format using RDF, so that it can be queried using SPARQL.
Linkeddata.org has a big list of RDF data sets.

Do you mean something like http://www.freebase.com?

You should check out https://www.wikidata.org/wiki/Wikidata:Main_Page which is a bit different but still may be of interest.

Something that might come close to your requirements is Google Docs.
What's offered is document editing roughly similar to MS Word, and spreadsheets roughly similar to Excel. I'm thinking of the latter, of course.
In Google Docs, You can create spreadsheets for free; being spreadsheets, they naturally have a row-and-column structure similar to a database, and which you can define flexibly. You can also share these sheets with other people. This seems to be a by-invite-only process rather than open-to-all, but there may be other possibilities I'm not aware of, or that level of sharing might be enough for you in any case.

mindtouch should be able to do it. It's rather easy to get data in / out. (for example: it's trivial to aggregate all the IP's for servers into one table).
I pretty much use it as a DB in the wiki itself (pages have tables, key/value..inheritance, templates, etc...) but you can also interface with the API, write dekiscript, grab the XML...

I like this idea. I have heard of some sites that are trying to pull together large datasets for various things for open consumption, but none that would allow a wiki feel.
You could start with something as simple as an installation of phpMyAdmin with a known password that would allow people to log in, create a database, edit data and query from any other site on the web.
It might suffer from more accuracy problems than wikipedia though.

OpenRecord, development of which seems to have halted in 2008, seems to approach this. It is a structured wiki in which pages are views on the data. Unlike RDBMSes it is loosely typed - the system tries to make a best guess about what data you entered, but defaults to text when it cannot guess. Schemas appear to have been implied.
http://openrecord.org
An example of the typing that is given is that of a date. If you enter '2008' in a record, the system interprets this as a date. If you enter 'unknown' however, the system allows that as well.

Perhaps you might be interested in Couch DB:
Apache CouchDB is a document-oriented
database that can be queried and
indexed in a MapReduce fashion using
JavaScript. CouchDB also offers
incremental replication with
bi-directional conflict detection and
resolution.

I'm working on an Open Source PHP / Symfony / PostgreSQL app that does this.
It allows multiple projects, each project can have multiple directories, each directory has a defined field structure. Admins set all this up.
Then members of the public can suggest new records, edit or report existing ones. All this is moderated and versioned.
It's early days yet but it basically works and is already in real world use in several projects.
Future plans already in progress include tools to help keep the data up to date, better searching/querying and field types that allow translations of content between languages.
There is more at http://www.directoki.org/

I'm surprised that nobody has mentioned Wikibase yet, which is the software that powers Wikidata.

Related

Developing a search and tag heavy website

I'm in the planning phase of developing a very tag heavy website. Everything will essentially be associated with tags and the entire site would be based on searching these tags.
Now, I've been thinking a lot about going the nosql route here, since from what I read and understand, it makes the most sense for something like this.
Would it be best to go with this database system? Would it makes sense to go with the relational database system? Should I think incorporating something like SOLR?
What would the ideal setup be?
UPDATE:
Ideally they would be user generated, but we all know how that would turn out with giving users that much power. So, let’s change up the requirements and say that users WILL NOT have the power to create tags.
Searching on tags based on text matches is something that would probably be useful and needed. If the tag is “garage sale”, the search for “sale” should also pick this up, at a lower relevance for sure.
I can’t imagine the usage being so much that scaling would be an issue.
Thanks
I would spend a bit of time thinking about these tags. For example, are these tags going to be user generated or will you provide a few tags and let users select which ones they want?
Will you need to search on tags based on text matches? For example if a tag is "garage sale" do you want to search for "sale" to also pick this up? Maybe at a lower relevance?
Also, what kind of usage are you looking at? One good thing about Solr is that it's super easy to scale and synchronize data, it is easy to deploy multiple nodes, shard collections and replicate data to other nodes, something that traditional databases struggle with.
Another thing to keep in mind is that most of the time, Solr is not the official "repository of record", most of the time the data gets fed to it from a DB somewhere, but all reading activities are done from Solr.
See this answer for a SQL solution. Offhand I can't think of any advantage to using most NoSQL databases (i.e. key-value, columnar, or document) as the SQL solution will be more compact and ought to give good performance; a graph database may be appropriate if you're doing a lot of navigational type queries on your tags, but it doesn't sound like that's the case.
Use of Solr (or ElasticSearch or whatever) is orthogonal to your primary database; it may be appropriate to incorporate a search tool if users are typing inexact tags for search, but I recommend integrating a stemming library or something along those lines before turning to a full blown search tool.

JSON vs classic schema design [duplicate]

The Project
I've been asked to work on an interesting project -- what amounts to a basic Web CMS -- that uses HTML/CSS/jQuery with PHP. However, one requirement is that there won't be a database to house the data (they want flat files for the documents/pages -- preferable in JSON format).
In a very basic sense, it'll be used to generate HTML pages via a very "non-techie" interface. Each installation would only have around 20 pages, but a few may get up to 100. It has to be fairly easy to drop onto a PHP capable server and run, with very little setup needed.
What's Out There
There are tons of CMS options and quite a few flat file versions. But an OSS or other existing CMS is not an option. They need a simple propriety system.
Initial Thoughts
So flat files it is... but I'd really like to get some feedback on the drawbacks, and if it is worth the effort to try and convince them to use something like MySQL (SQLite or CouchDB are out since none of the servers can be configured to run them at the present time).
Of course the document files are pretty straightforward, but we're also talking about login info for 1 or 2 admins per installation, a few lists, as well as configs/settings (which also can easily be stored in a file with protection).
The Dilemma
If there are benefits to using MySQL rather than JOSN formatted files and some arrays in a simple project like this -- beyond my own pre-conceived notions :) -- I'll be sure to argue them.
But honestly I can't see any that outweigh their need to not have a database system.
I'd appreciate you insight and opinions.
If you can't cite a specific need for relational table design, then you're good with flat files. Build as specified. The moment you can cite a specific need, let them know; upgrading isn't that hard, if you're perception is timely (that is, if you aren;t in the position of having to normalize data that should have been integrated earlier).
It's a shame you can't use CouchDB, this seems like the perfect application for it. Keep in mind that using flat-files severely constrains your architecture and, especially, scalability.
What's the best case scenario for your CMS app? It's successful and people want to use it more? If you're using flat-files it'll be harder to service and improve your system (e.g. make it more robust, and add new features for future versions) and performance will not scale well. So "success" in this case is at best short-lived, as success translates into more and more work for less and less gains in feature-set and performance.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
Will this be installed on any shared hosting sites. For this to work somewhat safely, a mechanism like suEXEC needs to be set up properly as the web server will need write permissions to various directories.
What would be cool with a simple site that was feed via JSON and jQuery is that the site wouldn't need to load on each click. Just the relevant data would change. You could then use hashes in the location bar to keep track of where you were (ex. http://localhost/#about)
The problem being if they are editing the raw JSON file they can mess it up pretty quick. I think your admin tools would have to generate the JSON files based on the input so that you can ensure nothing breaks. The admin tools would be more entailed then the site (though isn't that always the case with dynamic sites)
What is the predicted data sizes for the CMS?
A large reason for the use of a RDMS is quick,specific access to large amounts of data. The data format might not be large, but if there is a lot of the data, then it might be better in the long run for a RDMS.
Then again, if the CSM is designed right, then switching between a flat file to RDMS should be as simple as using a different data access file.
While an RDBMS may be necessary for a very large CMS, a small one could run off flat files very well. A lot of CMS products out there fall down in that regard, I think, by throwing an RDBMS into the mix when there's no real need.
However, if you are using flat files, there are security issues which others have highlighted. Another issue I've come across is hosting providers using the disable_functions directive in php.ini to disable file I/O functions like fopen() and friends. If you're hosting your CMS on a box you control, you won't have this problem but if you're using a third-party provider, check first.
As the original poster, I wasn't signed in, so I'm following up to the answers so far in an answer (sorry if this is bad form).
There may instances where this is on
a shared host.
Though the JSON files can technically
be edited, this won't be the case.
The admin interface will be robust
enough to do all of the creating/editing of pages
The size for each install will be
relatively small -- 1 - 2 admins,
10-100 pages. A few lists of common
items may run longer (snippets of
copy for example).
Security will be a big issue -- any
other options suggestions on this
specifically?
Well, isn't there a problem with they being distrustful to any database system? Isn't the problem more in their thinking than in technology? Maybe they are afraid of database because it sounds complex to them. In that case, if you just present them some very simple CMS (like CMS made simple, which I've heard is really simple and the learning process is very fast), if they see everything is easy then may be they just don't care what's behind, if it's a database or whatever!
They could hear to arguments like better maintenance, lower cost of maintenance, much better handover to another webmaster than proprietary solutions (they are not dependent on you) etc.

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.
http://postgis.refractions.net/

How to analyse Wikipedia article's data base with R?

This is a "big" question, that I don't know how to start, so I hope some of you can give me a direction. And if this is not a "good" question, I will close the thread with an apology.
I wish to go through the database of Wikipedia (let's say the English one), and do statistics. For example, I am interested in how many active editors (which should be defined) Wikipedia had at each point of time (let's say in the last 2 years).
I don't know how to build such a database, how to access it, how to know which types of data it has and so on. So my questions are:
What tools do I need for this (besides basic R) ? MySQL on my computer? RODBC database connection?
How do you start planning for such a project?
You'll want to start here:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Which will take you to here:
http://download.wikimedia.org/enwiki/20100312/
And the file you probably want is:
# 2010-03-17 04:33:50 done Log events to all pages.
* This contains the log of actions performed on pages.
* pages-logging.xml.gz 1.0 GB
http://download.wikimedia.org/enwiki/20100312/enwiki-20100312-pages-logging.xml.gz
You'll then import the xml into MySQL. Generating a histogram of users per day, week, year, etc. won't require R. You'll be able to do that with a single MySQL query. Something like:
select DAYOFYEAR(wiki_edit_timestamp), count(*)
from page_logs
group by DAYOFYEAR(wiki_edit_timestamp)
order by DAYOFYEAR(wiki_edit_timestamp);
etc.
(I'm not sure what their actual schema is, but it'll be something like that.)
You'll run into issues, no doubt, but you'll learn a lot too. Good luck!
You could
work with the wikipedia database dumps, as already mentioned
work with the live mediawiki API, see this minimal example at Rosettacode or my unfinished approach with a S3 class or this package by Peter Konings
work with dbpedia, an effort to extract knowledge from wikipedia into a knowledge base. They offer an online sparql access I don't know much about, and also datasets as n-triples for download. See this python script which might be a starting point for an R script. This approach might be useful to access the content stored in the wikipedia (such as the infoboxes) but I am not sure if information on contributors to the wikipedia is available.
Try WikiXRay (Python/R) and zotero.

How do you maintain a library of useful SQL in a team environment?

At my work everyone has sql snippets that they use to answer questions. Some are specific to a customer, while some are generic for a given database. I want to consolidate those queries into a library/repository that can be accessed by anyone on the team. The requirements would be:
Accessible
Searchable
Tagable (multiple tags allowed per sql)
Exportable (create a document containing all queries with certain tags)
I'm interested in what has been found to work in other team environments.
You could use a wiki.
You could get started with something as simple as Tiddly wiki.
A wiki is a great approach.
For database specific or project specific snippets it's also very useful to have links to where a similar construct occurs in the code. We use trac's wiki which gives nice integration with out SVN for this.
Rather than pasting SQL snippets, I would consider graduating to an ORM (Object-Relational Mapper) or some other library to make representing and manipulating the data easier. It provides a layer of encapsulation to guard against schema changes and a layer of abstraction so you can think of the data in terms of business logic (ie. a user) rather than a collection of tables (ie. a user table, a password table, an access table...).
In Perl this would be something like DBIx::Class.
Another approach you may want to look at is creating views in your database. 'select * from some_view' can hide quite a bit of SQL. You'll still want to use a wiki to document them, but if its a view you don't have to worry about people keeping outdated copies.