How to turn MongoDB collection into a 'Table' - mongodb-query

I've been given access to a cloud MongoDB (MongoLab) and need to extract some data into Excel so I can analyse it. The data isn't particularly complicated or large and is well suited to a 'normal' relational structure.
My research suggests things are trickier because the data has 'nested' aspects although conceptually its pretty clear how this would become a table. Here is what a document in the collection looks like, essntinaly the stuff highlighted blue would be columns in the table while the yellow would create a row for each "marketing_event" with the specifics of each event also being in a column:
Ideally I would use Power Query to get the data into Power Pivot but at this point anything will do!
I've tried a bunch of things all of which haven't got me much closer to end result that I'm looking for:
I downloaded MongoVue which I used to successfully connect to the database and while it enabled me to see the data in a basic table form, it does nothing with the nested stuff and the documentation is minimal in terms of how it could be of more use.
I also tried Pentaho PDI based on this article:http://sqlmag.com/blog/integrating-mongodb-and-open-source-data-stores-power-pivot but the steps aren't detailed and although I can see the collection, trying to replicate some sample queries I found on the web were totally unsuccesful.
I've tried to get a trial of Simba's ODBC connector but as yet the download doesn't seem to be working. I have contacted them but without response just yet.
I've even installed Mongo locally and tried to use the command prompt to connect which I was unable to do. Even if I pursued this I wouldn't be confident about knowing where to start in terms of creating the end product.
Happy to hear any suggestions or recommendations.
TIA
Jacob

Here's a solid ODBC driver that helps maintain the fidelity of your mongoDB data by exposing the nested MongoDB data model as a set of relational tables to Excel and other ODBC apps. in the sample document above, this driver will do exactly what you're looking for. The embedded documents and arrays can be extracted as separate related tables from the fields at the root level of the document.
https://www.progress.com/odbc/mongodb

I don't know if you already found the solution - but Simba ODBC is providing support for nested arrays.
Have a look here:
https://www.simba.com/resources/webinars/connect-tableau-big-data-source. This is an example how to connect Tableau BI to MongoDB. You might find it helpful.
And some more information on handling no-sql data in BI tools is provided in this whitepaper: http://info.mongodb.com/rs/mongodb/images/MongoDB_BI_Analytics.pdf

Related

Best way to save SQL versions while working on Tableau?

I am working using Tableau and have to write down multiple different SQL each time, while making new data sources.
I have to save all changes on SQL for every data source.
Currently I would paste the SQL on notepad and save them on separate folder in my computer, along with description of the changes.
Is there any better way to do this?
Assuming you have permission to create objects in the database, begin by creating database views, As #Nick.McDermaid commented.
Then, instead of using Custom SQL data source in Tableau, just connect to the View as if it were a table.
If you need to track the changes to these SQL views of your data, you will need to learn how to use source control for the .sql files that can be scripted from within SQL Server Management Studio:
Your company or school may have a preferred source control system already in use, in which case you should use that. If they don't, or if you are learning at home, then Git and Subversion are popular open source choices.
There are many courses available on learning platforms like Coursera that will teach you how to learn how to use those systems.
I had similar problem as you.
We ended up writing the queries in SQL Editor SQL Work bench (https://www.sql-workbench.eu/), then managed the code history and performed code peer-review (logic, error check, etc) in team shared space (like confluence).
The reasons we did that is
1) SQL queries are much easy to write on Work Bench
2) Code review is a must! You will find through implementing a review process more mistakes than you could ever think about
3) The shared space is just really convenient as it is accessible by everyone, and all errors are documented. After sometimes you get a lot of visible knowledge accumulated.
I also totally agree with Nick as this is one step to a reporting solution. But developing a whole reporting server is heavy, costly and takes time. Unless management are really convinced of the importance of developing a reporting solution, you may have to get a workaround with queries and Tableau (at least that was the case for us)
A little late to the party, but I would suggest you simply version the tableau workbook. The contents of the workbook are XML, so perfect for versioning using file based tools (Dropbox, One Drive, etc.) or source control (git, etc.). The workbooks themselves are usually quite small, so just make sure to keep the extract data separate if you use it.

Is there a way to let non-technical individuals utilize BigQuery reports?

I want to have an access port for non-tech savvy individuals in which they could make reports of their own without needing to know SQL what-so-ever.
It would be best if I could create custom fields of myself, and then just let the users in the access port pick and choose whichever they like with a custom date range.
I've explored the options Google Data Studio offers, but it looks to me like it mostly puts an emphasis on data visualization.
In addition, my attempts to make custom queries with it were not successful, since the platform is rigid in terms of deciding which field is a metric and which is a dimension (and it does so inaccurately). This makes it hard to query reports as you normally would using BigQuery, which doesn't have these somewhat arbitrary limitations.
Perhaps I've misunderstood something about the platform due to my limited experience with it, but it looks like Data Studio isn't going to fit the bill for me.
EDIT: In addition, the platform should have a way of exporting said reports as CSV files, a feature that Data Studio doesn't have as far as I know.
It would be great to receive suggestions for a different platform which would better fit my needs, or even suggestions on how to make better use of Data Studio.
Have you looked at using a tool like redash (https://redash.io)? Assuming your GA360 data is in BigQuery you can connect redash to BQ. Then you can author queries and visualize.
You can also use the Google Could SDK to connect to BQ and run custom queries to generate new tables in BQ based on the GA360 session data. Then use redash, or any tool, to report/visualize.

I need advise choosing a NoSQL database for a project with a lot of minute related information

I am currently working on a private project that is going to use Google's GTFS spec to get information about 100s of Public Transit agencies, their routers, stations, times, and other related information. I will be getting my information from here and the google code wiki page with similar info. There is a lot of data and its partitioned into multiple CSV formatted text files. These can be huge, some ranging in 80-100mb of data.
With the data I have, I want to translate it all into a nice solid database that I can build layers on top of to use for my project. I will be using GPS positioning to pinpoint a location and all surrounding stations/stops.
My goal is to access all the information for all these stops and stations with as few calls as possible, while keeping datasets small for queried results.
I am currently leaning towards MongoDB and CouchDB for their GeoSpatial support that can really optimize getting small datasets. But I also need to be sure to link all the stops on a route because I will be propagating information along a transit route for that line. In this case I have found that I can benefit from a Graph DB like Neo4j and OrientDB, but from what I know, neither has GeoSpatial support nor am I 100% sure that a Graph DB would be what I need.
The perfect solution might not exist, but I come here asking for help on finding the best possible for my situation. I know I will possible have to work around limitations of whatever I choose, but I want to at least have done my research and know that its the best I can get at the moment.
I have also been suggested to splinter the data into multiple DBs, but that could get very messy because all the information is very tightly interconnected through IDs.
Any help would be appreciated.
Obviously a graph database fits 100% your problem. My advice here is to go for some geo spatial module over neo4j or orientdb, althought you have some others free and open source implementation.
I think the best one right now, with all the geo spatial thing implemented is neo4j-spatial package. But as far as I know, you can also reproduce most of the geo spatial thing on your own if necessary.
BTW talking about splitting, if the amount of data/queries will be high, I strongly recommend you to share the load and think the model in this terms. Sure you can do something.
I've used Mongo's GeoSpatial features and can offer some guidance if you need help with a C# or javascript implementation - I would recommend it to start because it's super easy to use. I'm learning all about Neo4j right now and I am working on a hybrid approach that takes advantage of both Mongo and Neo4j. You might want to cross reference the documents in Mongo to the nodes in Neo4j using the Mongo object id.
For my hybrid implementation, I'm storing profiles and any other large static data in Mongo. In Neo4j, I'm storing relationships like friend and friend-of-friend. If I wanted to analyze movies two friends are most likely to want to watch together (or really any other relationship I hadn't thought of initially), by keeping that object id reference I can simply add some code instructing each node go out and grab a list of movies from the related profile.
Added 2011-02-12:
Just wanted to follow up on this "hybrid" idea as I created prototypes for and implemented a few more solutions recently where I ended up using more than one database. Martin Fowler refers to this as "Polyglot Persistence."
I'm finding that I am often using a combination of a relational database, document database and a graph database (in my case this is generally SQL Server, MongoDB and Neo4j). Since the question is related to data modeling as much as it is to geospatial, I thought I would touch on that here:
I've used Neo4j for site organization (similar to the idea of hypermedia in the REST model), modeling social data and building recommendations (often based on social data). As a result, I will generally model this part of the application before I begin programming.
I often end up using MongoDB for prototyping the rest of the application because it provides such a simple persistence mechanism. I like to start developing an application with the user interface, so this ends up working well.
When I start moving entities from Mongo to SQL Server, the context is usually important - for instance, if I have an application that allows users to build daily reports based on periodically collected data, it may make sense to run a procedure that builds those reports each night and stores daily report objects in Mongo that may be combined into larger aggregate reports as needed (obviously this doesn't consider a few special cases, but that is not relevant to the point)...on the other hand, if users need to pull on-demand reports limited to very specific time periods, it may make sense to keep everything in SQL server and build those reports as needed.
That said, and this deserves more intense thought, here are some considerations that may be helpful:
I generally try to store entities in a relational database if I find that pulling an entity from the database [in other words(in the context of a relational database) - querying data from the database that provides the data required to generate an entity or list of entities that fulfills the requested parameters] does not require significant processing (multiple joins, for instance)
Do you require ACID compliance(aside:if you have a graph problem, you can leverage Neo4j for this)? There are document databases with ACID compliance, but there's a reason Mongo is not: What does MongoDB not being ACID compliant really mean?
One use of Mongo I saw in the wild that I thought was worthy of mention - Hadoop was being used to compute massive hash tables that were then stored in Mongo. I believe a similar approach is used by TripAdvisor for user based customization in terms of targeting offers, advertising, etc..
NoSQL only exists because MySQL users assume that all databases have their performance problems when their database grows large and/or becomes complex.
I suggest that you use PostGIS. You can use the same database for the rest of your data needs as well.
http://postgis.refractions.net/

Schema-agnostic OLAP-like tool?

Do there exist any (ideally free or open-source) tools for performing OLAP analyses on arbitrary tables in a relational database, without requiring any advance specification of dimensional hierarchies, cardinalities, or any other meta-information about the table beyond what can be extracted automatically from the table itself?
My inability to Google for anything like what I'm describing makes me suspect I'm using incorrect terminology and what I'm searching for isn't properly considered to be OLAP. If this is the case, what I specifically want is anything that would let technically unsophisticated users create cross-tab or contingency table aggregations using tables in a relational DB without needing to write elaborate SQL queries.
Or, in other words, I'd like something that mimics Excel's PivotTables on a larger scale. I appreciate that Excel does indeed generate extensive caches behind the scenes when you make a PivotTable, but it does this without the user having to explain to it which caches need creating. This is the functionality I'm trying to find elsewhere, if it exists.
The best options I know of are Excel and Access, but of course they are not open source. This space kinda got trampled in the explosion of interest in what is now called Business Intelligence and a lot of companies got bought by MS and others. It's pretty thin now as far as I can tell. I'll watch this thread though.
The most useful paradigm to attach to is I think spreadsheets and there's not much competition there any more. Google Docs spreadsheets can import csv etc. exported from databases, and there's a pivot chart available, but not much more.
The other place I've seen OLAP capabilities is in the Adobe Flex libraries to build on with ActionScript if you have any inclination in that direction. As usual, Adobe manages to get it about 90% right but doesn't quite provide a whole product.
icCube aims to setup an OLAP cube as simply as possible. It is not schema-agnostic, but I guess this is quite simple to define dimensions and facts from existing DB tables. Nevertheless, this could be not so "simple" depending on your tables - difficult to say without knowledge about them. I guess there's no generic easy solution ;-)
Then you can use Excel pivot table (amongst others) to access the cubes. Note as far as I know Excel does not do any caching neither aggregation when connecting to a cube. Indeed, it is generating all the required MDX requests to the cube.
Hope that helps.

Wiki Database, is there one?

I was searching the net for something like a wiki database, just like wikipedia but instead stores structured content, editable by users. What I was looking for was an online database accessible by everyone where people can design the schema and data with proper versioning of both schema and data. I couldn't find any such site. I am not sure if it is my search skills or if there really is no wiki database as of now. Does anyone out there know anything like this?
I think there is a great potential for something like this. A possible example will be a website with a GUI for querying a MySQL DB where any website visitor can create DB objects and populate data.
UPDATE: I had registered the domain wikidatabase.org to get started on a tool but I didn't find enough time yet. If anyone is interested in spending some time and coding on this, please let me know at wikidatabase.org
It's not quite what you're looking for, but Semantic Mediawiki adds database-like features to MediaWiki:
http://semantic-mediawiki.org/wiki/Semantic_MediaWiki
It's still fundamentally a Wiki, but you can add semantic tags to pages ([[foo::bar]] [[baz::1000]]) and then do database-type queries across them: SELECT baz FROM pages WHERE foo=bar would be {{#ask: [[foo::bar]] | ?baz}}. There is even an embryonic SPARQL implementation for pseudo-SQL queries.
OK this question is old, but Google led me here, so for anyone else out there looking for a wiki for structured data: Take a look at Foswiki.
This might be like what you're looking for: dbpedia.org. They're working on extracting data from Wikipedia, and encoding it in a structured format using RDF, so that it can be queried using SPARQL.
Linkeddata.org has a big list of RDF data sets.
Do you mean something like http://www.freebase.com?
You should check out https://www.wikidata.org/wiki/Wikidata:Main_Page which is a bit different but still may be of interest.
Something that might come close to your requirements is Google Docs.
What's offered is document editing roughly similar to MS Word, and spreadsheets roughly similar to Excel. I'm thinking of the latter, of course.
In Google Docs, You can create spreadsheets for free; being spreadsheets, they naturally have a row-and-column structure similar to a database, and which you can define flexibly. You can also share these sheets with other people. This seems to be a by-invite-only process rather than open-to-all, but there may be other possibilities I'm not aware of, or that level of sharing might be enough for you in any case.
mindtouch should be able to do it. It's rather easy to get data in / out. (for example: it's trivial to aggregate all the IP's for servers into one table).
I pretty much use it as a DB in the wiki itself (pages have tables, key/value..inheritance, templates, etc...) but you can also interface with the API, write dekiscript, grab the XML...
I like this idea. I have heard of some sites that are trying to pull together large datasets for various things for open consumption, but none that would allow a wiki feel.
You could start with something as simple as an installation of phpMyAdmin with a known password that would allow people to log in, create a database, edit data and query from any other site on the web.
It might suffer from more accuracy problems than wikipedia though.
OpenRecord, development of which seems to have halted in 2008, seems to approach this. It is a structured wiki in which pages are views on the data. Unlike RDBMSes it is loosely typed - the system tries to make a best guess about what data you entered, but defaults to text when it cannot guess. Schemas appear to have been implied.
http://openrecord.org
An example of the typing that is given is that of a date. If you enter '2008' in a record, the system interprets this as a date. If you enter 'unknown' however, the system allows that as well.
Perhaps you might be interested in Couch DB:
Apache CouchDB is a document-oriented
database that can be queried and
indexed in a MapReduce fashion using
JavaScript. CouchDB also offers
incremental replication with
bi-directional conflict detection and
resolution.
I'm working on an Open Source PHP / Symfony / PostgreSQL app that does this.
It allows multiple projects, each project can have multiple directories, each directory has a defined field structure. Admins set all this up.
Then members of the public can suggest new records, edit or report existing ones. All this is moderated and versioned.
It's early days yet but it basically works and is already in real world use in several projects.
Future plans already in progress include tools to help keep the data up to date, better searching/querying and field types that allow translations of content between languages.
There is more at http://www.directoki.org/
I'm surprised that nobody has mentioned Wikibase yet, which is the software that powers Wikidata.