Question on best practice for creating views that are consumed by visualization tools like PowerBI or Tableau - sql

I've tried searching around to see what the best practices are when designing a view that will be used for visualization going directly into PowerBI or Tableau.
I don't know the best way to ask this but is there an issue with creating a big query 30+ columns with multiple joins in the DB for export into the visualization platform? I've seen some posts regarding size and about breaking up into multiple queries but those are also in reference to bringing into some program and writing logic in the program to do the joins etc.
I have tried both ways so far, smaller views that I then create relationships in PowerBI or larger views where I'm dealing with one just flat table. I realize that in most respects PowerBI can do a star scheme with data being brought in but I've also run into weird issues with filtering within the PowerBI itself, that I have been able to alleviate and speed up by doing that work in the DB instead.
Database is a Snowflake warehouse.

Wherever possible, you should be using the underlying database to do the work that databases are good at i.e. selecting/filtering/aggregating data. So your BI tools should be querying those tables rather than bringing all the data into the BI tool as one big dataset and then letting the BI tool process it

Related

NoSQL or SQL or Other Tools for scaling excel spreadsheets

I am looking to convert an excel spreadsheet into more of a scalable solution for reporting. The volume of the data is not very large. At the moment the spreadsheet around 5k rows and grows by about 10 every day. There are also semi-frequent changes in how we capture information i.e. new columns as we starting to mature the processes. The spreadsheet just stores attributes or dimensions data on cases.
I am just unsure whether I should use a traditional SQL database or NoSQL database (or any other tool). I have no experience in NoSQL but I understand that it is designed to be very flexible which is what I want compared to a traditional DB.
Any thoughts would be appreciated :) !
Your dataset is really small and any SQL database (say, PostgreSQL) will work just fine. Stay away from NoSQL DBs as they are more limited in terms of reporting capability.
However, since your facts schema is still not stable ("new columns as we starting to mature the processes.") you may simply use your Spreadsheet as a data source in BI tools. To keep your reports up-to-date you may use the following process:
Store your Spreadsheet on cloud storage (like Google Drive or OneDrive)
Use codeless automation platform (like Zapier) to setup a job to sync Spreadsheet file with BI tool when it changes. This is easily possible if BI tool is SeekTable, for instance.

Where does SQL stop and data modelling in Power BI start?

I am creating a dataset In Power BI Desktop using data in a SQL Server database. I have read the sqlbi article that recommends using database views but I would like to know: how do I structure them?
There is already a view of the database that contains most of the information I need. This view consists of 7 or 8 other, more basic views (mainly 2 column tables with keys and values), combined using left joins. Do I import the larger view as a flat table, or each of the smaller views and create a relationships etc, ideally in a star schema, in Power BI?
I guess conceptually I am asking: where does the SQL stop and Power BI start when it comes to creating and importing views?
where does the SQL stop and Power BI start when it comes to creating and importing views?
Great question. No simple answer. Typically modeling in Power BI is quicker and easier than modeling in the database. But modeling in the database enables DirectQuery, and is more useful outside of Power BI.
Sometimes it boils down to who is creating the model. The "data warehouse" team will tend to create models in the database first, either with views or tables. Analysts and end-users tend to create models directly in Power BI.
Sometimes it boils down to whether the model is intended to be used in multiple reports or not.
There is no one-size-fits-all approach here.
If your larger view already has what you need and you need it for just one-off report then you can modify it to add additional fields(data points) considering the trade off for effort needed to create a schema.
The decision weather you should import smaller views and connect them as Star schema ( considering that they have a fact table surrounded by dimension tables) depends on if you are going to use that in lot of other reports where the data is connected i.e. giving you same level of information in every report.
Creating views also depends on lot of other factors, are you querying a reporting snapshot(or read-replicas) of your prod database or you are querying the actual production database. This might restrict you or impact the choice for Views and Materialized Views.

How to prepare a BigQuery database to best test different BI systems?

We have a database in bigquery that we want to prepare to test different BI systems (looker, chartio, bime, ...).
How should we organize the database?
Flatten and normalize
As a first step we have created normalized and flattened views that we intent to use for BI
Combine tables inte huge view
As I second step we have considered to create a huge view that cross joins all the normalized views into a huge view.
Test BI systems
We intend to initially use this huge view as data source for our tests of BI systems.
Split huge view to improve performance
When we have selected a BI system we intend to create smaller views instead of the huge view to support each dashboard we end up building, to improve speed.
Does this sound like a good approach? Would a different apprach be better?
For BigQuery it makes sense to denormalize the data. Joins are (relatively) expensive. BigQuery is very fast when querying a single table, even if it has >200 denormalized columns. Most of these tools will generate queries on the fly, so if you query a view all the time it will have to do lots of joins for each visualization, and you'll end up waiting a long time.
If you have to work with this normalized data, please do all the joins once and "materialize" it by creating a permanent table out of it, and build all your visualizations on top of that.

ETL - Views or persist tables?

When building a Data Warehouse I usually see two main approaches for the ETL-process:
1. View - View of views - View of views of views - ...
Approach one is obviously in the database and has the advantage that you don't have that much redundant data, but could lead to performance issues.
2. Stage table (copy of data) - clear table (copy of data) - dwh table (copy of data) - ...
Approach two could be done with many tools as stored procedures and jobs or a ETL-tool like SSIS.
The advantage here is that it's easy to understand the process as you can visualise it pretty good. You usually also have a very good overall ETL-performance and many predefined tasks etc.
A problem could be for example, that a change of the process is more complex as persistent tables have to be changed.
In the real world you usually see a mix of both, especially when many people have worked on the process.
Of course it also depends on the situation (size of tables, how are similar processes designed in this company, how complex is the ETL-process, ...).
I personally prefer to copy tables, keep the ETL-process simple and if possible do everything in the ETL-tool (usually SSIS in my case) which is designed for this purpose.
But what is best practice and why?
views an views of views would not scale with volume of data in DWH. When it comes of dwh i mean we are talking about huge volumes of data. Integration of data from multiple sources is a common usecase for dwh. Stage->tranform-->fact/dim is the one of the most common way how dwh are built to store data. Yes this would change somewhat when we talk about hdfs and other technologies, but views would not be able to give you desired performance in dwh.
I have seen many systems and all of them have a multi step etl process where you first get data into dwh from sources and then clean/process/conform/transform this data via ETL/others in to your dimensional/other model.
If you want to know point-in-time reference data relationships, implemented in a dimensional DW as type-2 or type-3 slowly changing dimensions, you probably won't find this in a source system.
The scale issue mentioned by garpitmzn is not just about data volumes, but also the joins necessary to restructure and denormalise the data for dimensional analysis. Using views (unless materialised) you'd repeat potentially complex joins for every query. Better to do it once at the time the dimension is loaded.

SSAS cube from a flat table

I'm trying to figure it if one can build an SSAS cube quickly for prototyping from just one huge and wide table without doing any ETL and custom SQL. Is it even possible?
What we are trying to do, we have a bunch of these tables for different subject areas which were denormalized and a lot of efforts were put to create them and test them. We need a quick way to access this data now and run analytical queries but before we spend time on ETL/dimensional design, we wanted to build a quick cube.
Please do not suggest PowerPivot or any other in-memory tools - these tables are really big and we have very limited RAM at our disposal,
Yes, it's possible. Simply use the same table for creating both dimensions and cubes (measure groups). It's not ideal to do it like this for production, but you should be fine for prototyping.
Another alternative I always use in situations like this, create SQL views on top of the wide table to mimic the dimension and facts (dimensional model). And use views in the data source view. If you've time you can spend on creating the views, this is the best method. Because at the end of the prototype you know the model and functionality is working, and you just need to create physical data warehouse and ETL when you're ready to implement in production.