NoSQL or SQL or Other Tools for scaling excel spreadsheets - sql

I am looking to convert an excel spreadsheet into more of a scalable solution for reporting. The volume of the data is not very large. At the moment the spreadsheet around 5k rows and grows by about 10 every day. There are also semi-frequent changes in how we capture information i.e. new columns as we starting to mature the processes. The spreadsheet just stores attributes or dimensions data on cases.
I am just unsure whether I should use a traditional SQL database or NoSQL database (or any other tool). I have no experience in NoSQL but I understand that it is designed to be very flexible which is what I want compared to a traditional DB.
Any thoughts would be appreciated :) !

Your dataset is really small and any SQL database (say, PostgreSQL) will work just fine. Stay away from NoSQL DBs as they are more limited in terms of reporting capability.
However, since your facts schema is still not stable ("new columns as we starting to mature the processes.") you may simply use your Spreadsheet as a data source in BI tools. To keep your reports up-to-date you may use the following process:
Store your Spreadsheet on cloud storage (like Google Drive or OneDrive)
Use codeless automation platform (like Zapier) to setup a job to sync Spreadsheet file with BI tool when it changes. This is easily possible if BI tool is SeekTable, for instance.

Related

Most flexible way to store personal financial historical transaction/trade data

I'm not talking about time-series Open High Low Close data but rather user trade actions that result in a transaction for accounting purposes. I am not proficient at databases and typically avoid using them not because of difficulty but mostly due to efficiency. From my benchmarks on time-series data I found that CSV files to be performant/reliable compared to various time-series storage formats like HDF, LMDB, etc. I'm know I didn't try all possible databases out there for example leveldb, SQL, etc - but like I said I am not proficient in databases and avoid the unnecessary overhead if I can help it.
Is it possible to enumerate the different ways accounting financial data can be stored such that its possible to easily insert, delete, update the data. Once the data has been aggregated it doesn't need to be touched just traversed in order from past to future for accounting purposes. What methods are out there? Are databases my only option? CSV files would be a bit tougher here since time-series data implies a period where as transaction data is not predictable and happen as user events so a single CSV file with all financial history is possible except its expensive to update/insert/delete. I could recreate the CSV file each time too whenever I want to update/insert/delete - that is possible as well but maybe not what I am looking for. Another idea is to do monthly or daily statements stored as CSV. I want to be able to compare my stored data against remote resources for integrity and update my stored financial data if there are any inaccuracies.

Question on best practice for creating views that are consumed by visualization tools like PowerBI or Tableau

I've tried searching around to see what the best practices are when designing a view that will be used for visualization going directly into PowerBI or Tableau.
I don't know the best way to ask this but is there an issue with creating a big query 30+ columns with multiple joins in the DB for export into the visualization platform? I've seen some posts regarding size and about breaking up into multiple queries but those are also in reference to bringing into some program and writing logic in the program to do the joins etc.
I have tried both ways so far, smaller views that I then create relationships in PowerBI or larger views where I'm dealing with one just flat table. I realize that in most respects PowerBI can do a star scheme with data being brought in but I've also run into weird issues with filtering within the PowerBI itself, that I have been able to alleviate and speed up by doing that work in the DB instead.
Database is a Snowflake warehouse.
Wherever possible, you should be using the underlying database to do the work that databases are good at i.e. selecting/filtering/aggregating data. So your BI tools should be querying those tables rather than bringing all the data into the BI tool as one big dataset and then letting the BI tool process it

BI Data modeling - Traditional vs new approaches

Dear community,
I hope the headline gives you a hint of what I want to talk about / need advice.
I'm a BI Developer with 3 years of experience working on big BI projects - some on the health industry and some were on the finance industry when I was working at IBM.
On my current job I came to a startup company, the company has an operational DB for the purpose of the product and the data is on SQL Server DB.
For 4 months I was putting fires out regarding all the mass my predecessor did and now I'm ready for the next step - Modeling the operational DB tables for DWH DB to be able to extract and use the data for analytical and BI purposes.
I don't have any resources at all - so I will build the DWH first on the operational DB and then my vision is the DWH will be on Snowflake DB after I will get resources from my CTO.
The modeling issue:
When I'm tackling the issue of data modeling I encountered some confusion about the right way to model data, there is the traditional way I'm familiar with IBM, but there are the Cloud DWH modeling and the hybrid approach.
My model need to be flexible and the data should be extract very fast.
what is the best way to store and extract data for analytical purposes?
Fact tables with a lot of dimensions - normalize approach
OR
putting all the data I need with regard to granularity at the same table (thinking about the future, moving to Snowflake) I will have several tables each one with is one granularity and his world.
I'm just interested to hear what some of you implemented at your company and if you have an advise or UC you can share, I searched at the web a lot and what I saw is a lot of biased info and very confusing - nobody is really saying what is working in the real world.
Thanks in advance!
Well two key points of normalisation are to reduce disk space used and optimise data retrieval; neither of which are all that relevant in Snowflake. Storage is dirt cheap. And for the best part, the database is self-optimised - worse case you might have to set up clustering keys on very large tables (see: https://docs.snowflake.net/manuals/user-guide/tables-clustering-keys.html)
I've found that big tables with lots of columns perform better than many smaller tables with joins. For example when testing on a flat table with 10 mil rows, with a clustering key set up; it was about 180% faster than obtaining the same resultset but with a more complex model / multi-table.
If you're anticipating a lot of writeback and require object level changes, then you should still consider normalisation - but you'd be better off with star schema in that case.

How to mix RDMS DB with a Graph DB

I am developing a website using Django, and PostgreSQL which would seemingly have huge amount of data as gathered in social network sites.
I need to use RDMS with SQL for tabular data for less SQL complexity and also Graph DB with Cipher for large data for high query complexity.
Please let me know how to go about this. Also please let me know whether it is feasible.
EDIT: Clarity as asked in Comments:-
The database structure can be similar to that of a social network like Facebook. I've checked FB Engineering page for their open graph. For graph DB I can find only Neo4J graph DB with proper ACID values though I would prefer an open source graph DB. Graph DB structure, I require basically for summary of huge volume data pertaining to relationships like friends, updates, daily user related updates as individual relations. Horizontal Scalability is important for future up gradation to me.
I intend to use PostgreSQL for base informational data and push the relational data updates to graph DB like Facebook uses both MySql and open graph.
Based on your reply to my queries. I would first suggest looking at TitanDB. I believe it fulfills many of your requirements:
It is open source.
It scales horizontally.
In addition to meeting your requirements it has existed for quite sometime and many companies are using it in Production. The only thing you would have to get used to is that it uses TinkerPop traversals, not Cypher queries. Also note that I believe Titan is not ACID for most backends. This is a result of it being horizontally scalable.
If you would like a more structured (but significantly less mature) approach to Graph DBs then you can look at the stack that myself and some colleagues are working on MindmapsDB which sits on top of Titan, but uses a more "sql-like" query language.
OrientDB Gremlin is also a very good option but lacks the maturity and support of Titan.
There are many other graph vendors out there such as DSE Graph, IBM Graph, etc . . . but the ones I have listed above are the opensource ones I have worked with.

Alternatives to Essbase

I have Essbase as the BI solution (for Predictive Analytics and Data Mining) in my current workplace. It's a really clunky tool, hard to configure and slow to use. We're looking at alternatives. Any pointers as to where I can start at?
Is Microsoft Analysis Services an option I can look at? SAS or any others?
Essbase focus and strenght is in the information management space, not in predictive analytics and data mining.
The top players (and expensive ones) in this space are SAS (with Enterprise Miner & Enteprise Guide combination) and IBM with SPSS.
Microsoft SSAS (Analysis Services) is a lot less expensive (it's included with some SQL Server versions) and has good Data Mining capabilities but is more limited in the OR (operations research) and Econometrics/Statistics space.
Also, you could use R, an open source alternative, that is increasing its popularity and capabilities over time, for example some strong BI players (SAP, Microstrategy, Tableau, etc.) are developing R integration for predictive analytics and data mining.
Check www.kpionline.com , is a product cloud based in Artus.
It has many dashboards, scenarios and functions prebuilt to do analysis.
Other tool than you could check is microstrategy. It has many functions to analysis.