How to prepare a BigQuery database to best test different BI systems? - sql

We have a database in bigquery that we want to prepare to test different BI systems (looker, chartio, bime, ...).
How should we organize the database?
Flatten and normalize
As a first step we have created normalized and flattened views that we intent to use for BI
Combine tables inte huge view
As I second step we have considered to create a huge view that cross joins all the normalized views into a huge view.
Test BI systems
We intend to initially use this huge view as data source for our tests of BI systems.
Split huge view to improve performance
When we have selected a BI system we intend to create smaller views instead of the huge view to support each dashboard we end up building, to improve speed.
Does this sound like a good approach? Would a different apprach be better?

For BigQuery it makes sense to denormalize the data. Joins are (relatively) expensive. BigQuery is very fast when querying a single table, even if it has >200 denormalized columns. Most of these tools will generate queries on the fly, so if you query a view all the time it will have to do lots of joins for each visualization, and you'll end up waiting a long time.
If you have to work with this normalized data, please do all the joins once and "materialize" it by creating a permanent table out of it, and build all your visualizations on top of that.

Related

Question on best practice for creating views that are consumed by visualization tools like PowerBI or Tableau

I've tried searching around to see what the best practices are when designing a view that will be used for visualization going directly into PowerBI or Tableau.
I don't know the best way to ask this but is there an issue with creating a big query 30+ columns with multiple joins in the DB for export into the visualization platform? I've seen some posts regarding size and about breaking up into multiple queries but those are also in reference to bringing into some program and writing logic in the program to do the joins etc.
I have tried both ways so far, smaller views that I then create relationships in PowerBI or larger views where I'm dealing with one just flat table. I realize that in most respects PowerBI can do a star scheme with data being brought in but I've also run into weird issues with filtering within the PowerBI itself, that I have been able to alleviate and speed up by doing that work in the DB instead.
Database is a Snowflake warehouse.
Wherever possible, you should be using the underlying database to do the work that databases are good at i.e. selecting/filtering/aggregating data. So your BI tools should be querying those tables rather than bringing all the data into the BI tool as one big dataset and then letting the BI tool process it

Where does SQL stop and data modelling in Power BI start?

I am creating a dataset In Power BI Desktop using data in a SQL Server database. I have read the sqlbi article that recommends using database views but I would like to know: how do I structure them?
There is already a view of the database that contains most of the information I need. This view consists of 7 or 8 other, more basic views (mainly 2 column tables with keys and values), combined using left joins. Do I import the larger view as a flat table, or each of the smaller views and create a relationships etc, ideally in a star schema, in Power BI?
I guess conceptually I am asking: where does the SQL stop and Power BI start when it comes to creating and importing views?
where does the SQL stop and Power BI start when it comes to creating and importing views?
Great question. No simple answer. Typically modeling in Power BI is quicker and easier than modeling in the database. But modeling in the database enables DirectQuery, and is more useful outside of Power BI.
Sometimes it boils down to who is creating the model. The "data warehouse" team will tend to create models in the database first, either with views or tables. Analysts and end-users tend to create models directly in Power BI.
Sometimes it boils down to whether the model is intended to be used in multiple reports or not.
There is no one-size-fits-all approach here.
If your larger view already has what you need and you need it for just one-off report then you can modify it to add additional fields(data points) considering the trade off for effort needed to create a schema.
The decision weather you should import smaller views and connect them as Star schema ( considering that they have a fact table surrounded by dimension tables) depends on if you are going to use that in lot of other reports where the data is connected i.e. giving you same level of information in every report.
Creating views also depends on lot of other factors, are you querying a reporting snapshot(or read-replicas) of your prod database or you are querying the actual production database. This might restrict you or impact the choice for Views and Materialized Views.

ETL - Views or persist tables?

When building a Data Warehouse I usually see two main approaches for the ETL-process:
1. View - View of views - View of views of views - ...
Approach one is obviously in the database and has the advantage that you don't have that much redundant data, but could lead to performance issues.
2. Stage table (copy of data) - clear table (copy of data) - dwh table (copy of data) - ...
Approach two could be done with many tools as stored procedures and jobs or a ETL-tool like SSIS.
The advantage here is that it's easy to understand the process as you can visualise it pretty good. You usually also have a very good overall ETL-performance and many predefined tasks etc.
A problem could be for example, that a change of the process is more complex as persistent tables have to be changed.
In the real world you usually see a mix of both, especially when many people have worked on the process.
Of course it also depends on the situation (size of tables, how are similar processes designed in this company, how complex is the ETL-process, ...).
I personally prefer to copy tables, keep the ETL-process simple and if possible do everything in the ETL-tool (usually SSIS in my case) which is designed for this purpose.
But what is best practice and why?
views an views of views would not scale with volume of data in DWH. When it comes of dwh i mean we are talking about huge volumes of data. Integration of data from multiple sources is a common usecase for dwh. Stage->tranform-->fact/dim is the one of the most common way how dwh are built to store data. Yes this would change somewhat when we talk about hdfs and other technologies, but views would not be able to give you desired performance in dwh.
I have seen many systems and all of them have a multi step etl process where you first get data into dwh from sources and then clean/process/conform/transform this data via ETL/others in to your dimensional/other model.
If you want to know point-in-time reference data relationships, implemented in a dimensional DW as type-2 or type-3 slowly changing dimensions, you probably won't find this in a source system.
The scale issue mentioned by garpitmzn is not just about data volumes, but also the joins necessary to restructure and denormalise the data for dimensional analysis. Using views (unless materialised) you'd repeat potentially complex joins for every query. Better to do it once at the time the dimension is loaded.

SSAS cube from a flat table

I'm trying to figure it if one can build an SSAS cube quickly for prototyping from just one huge and wide table without doing any ETL and custom SQL. Is it even possible?
What we are trying to do, we have a bunch of these tables for different subject areas which were denormalized and a lot of efforts were put to create them and test them. We need a quick way to access this data now and run analytical queries but before we spend time on ETL/dimensional design, we wanted to build a quick cube.
Please do not suggest PowerPivot or any other in-memory tools - these tables are really big and we have very limited RAM at our disposal,
Yes, it's possible. Simply use the same table for creating both dimensions and cubes (measure groups). It's not ideal to do it like this for production, but you should be fine for prototyping.
Another alternative I always use in situations like this, create SQL views on top of the wide table to mimic the dimension and facts (dimensional model). And use views in the data source view. If you've time you can spend on creating the views, this is the best method. Because at the end of the prototype you know the model and functionality is working, and you just need to create physical data warehouse and ETL when you're ready to implement in production.

Operational database schema to data mart schema, table reduction?

I'm starting to study SQL Server Analysis Services and I'm working my way through the training book, as well as the Developer Training Kit. In both, I find suggestions that the number of tables used in an OLAP database (ideally, star schema) is greatly reduced from the production OLTP database.
From the training kit:
We followed the data dimensional methodology to architect the data mart schema. From some 200 tables in the operational database, the data mart schema contained about 10 dimension tables and 2 fact tables.
From what I understand, the operational databases are usually (somewhat) normalised and the data mart schemas are heavily denormalised. I also believe that denormalising data usually involves adding more tables, not less.
I can't see how you can go from 200 tables to 12, unless you only need to report on a subset of data. And if you do only need to report on a subset of data, why can't you just use the appropriate tables in the operational database (unless there are significant performance gains to be made by using a denormalised star schema)?
Denormalizing is exactly the opposite of Normalizing a database. In a normalized database everything is spit apart into different tables to support concurrent writes to the data. This also has the side effect of generating any given subset of data exactly once (In an ideal 3rd normal form data structrure). A draw back of normalizing is that reads take a lot longer because of the fact that the data is scattered and we need to join tables to make sense of it again (Joins are pretty expensive operations).
When we denormalize, we are taking the data from multiple tables and merging them in to one table. So now we have repeating data in these tables. The repeating data is useful because we don't have to make joins to any other table to get it anymore. Writing to the data store is normally a bad idea because it would mean alot of writes to change all of the data in a table, whereas it would only take one in a normalized database.
OLTP stands for Online Transactional Processing, notice the word Transactional. Transactions are write operations and the OLTP model is optimiized for this. OLAP stands for Online Analytical Processing, Analysis being the keyword meaning lots of reads.
Going from 200 tables to 12 in an OLTP to OLAP process will suprisingly hold nearly all of the data in the OLTP database plus more. The OLTP is unable to record all of the changes over time, but OLAP specializes in this so you get all of your historical data as well as current data.
The star schema is probably the most common for OLAP data stores, the snowflake schema is also pretty common. You should learn about both and how to properly use them. It's just another great tool in your arsenal.
These two books from IBM will answer your questions much more thouroughly and they are free pdf's.
http://www.redbooks.ibm.com/abstracts/sg247138.html
http://www.redbooks.ibm.com/abstracts/sg242238.html