I'm currently working on top of a Tabular SSAS database with transactional data. The data model contains a lot of tables (with hundreds of millions of rows) and relationships.
This database is connected to many PowerBI dashboards, and sadly performance of visualizations has dropped critically.
Currently, the solution has partitions and paging. Moreover, its storage space is 60% full.
Because many of the visualizations in PowerBI have many filters based on columns, I was thinking about creating some indexes based on these columns.
Is this way a possible solution? Is there any other way to optimize the performance?
I've tried searching around to see what the best practices are when designing a view that will be used for visualization going directly into PowerBI or Tableau.
I don't know the best way to ask this but is there an issue with creating a big query 30+ columns with multiple joins in the DB for export into the visualization platform? I've seen some posts regarding size and about breaking up into multiple queries but those are also in reference to bringing into some program and writing logic in the program to do the joins etc.
I have tried both ways so far, smaller views that I then create relationships in PowerBI or larger views where I'm dealing with one just flat table. I realize that in most respects PowerBI can do a star scheme with data being brought in but I've also run into weird issues with filtering within the PowerBI itself, that I have been able to alleviate and speed up by doing that work in the DB instead.
Database is a Snowflake warehouse.
Wherever possible, you should be using the underlying database to do the work that databases are good at i.e. selecting/filtering/aggregating data. So your BI tools should be querying those tables rather than bringing all the data into the BI tool as one big dataset and then letting the BI tool process it
We have a database in bigquery that we want to prepare to test different BI systems (looker, chartio, bime, ...).
How should we organize the database?
Flatten and normalize
As a first step we have created normalized and flattened views that we intent to use for BI
Combine tables inte huge view
As I second step we have considered to create a huge view that cross joins all the normalized views into a huge view.
Test BI systems
We intend to initially use this huge view as data source for our tests of BI systems.
Split huge view to improve performance
When we have selected a BI system we intend to create smaller views instead of the huge view to support each dashboard we end up building, to improve speed.
Does this sound like a good approach? Would a different apprach be better?
For BigQuery it makes sense to denormalize the data. Joins are (relatively) expensive. BigQuery is very fast when querying a single table, even if it has >200 denormalized columns. Most of these tools will generate queries on the fly, so if you query a view all the time it will have to do lots of joins for each visualization, and you'll end up waiting a long time.
If you have to work with this normalized data, please do all the joins once and "materialize" it by creating a permanent table out of it, and build all your visualizations on top of that.
I'm new to Analysis Services
My first cube has been deployed and it seems to work.
Dimension tables are ok and fact tables are ok.
My question is very simple : If I add a new record in the related datasource table,
Browsing the cube, I don't see the new record until process again the cube.
In my mind I think if new records are addedd, then cube must reflect the changes.
How to solve this issue? Do I need to reprocess the cube every time a new record is added? This is impossible of course.
You understand that essentially your cube represents a bunch of aggregated measures? That means that when the cube is processed it looks at all the data that is in your fact tables and processes the Measures (according to the dimensions).
The result of this is that you're able to access the data in the cube quickly and efficiently. The downside is as you have mentioned is that when new data is added to the fact table the cube isn't updated.
Typically there will be a daily batch job that will update the cube with the latest fact data, depending on the amount of data you have and the "real-time" requirements this could be done more than once p/day. A lot of people do this out of hours.
If you look closely in BIDS you will notice on the Partitions tab that for each partition it has a Storage Mode which you can define.
I would recommend you read this this article http://sqlblog.com/blogs/jorg_klein/archive/2008/03/27/ssas-molap-rolap-and-holap-storage-types.aspx
Basically, there are a few different modes you can use:
MOLAP (Multi dimensional Online Analytical Processing)
MOLAP is the most used storage type. Its designed to offer maximum query performance to the users. Data AND aggregations are stored in optimized format in the cube. The data inside the cube will refresh only when the cube is processed, so latency is high.
ROLAP (Relational Online Analytical Processing)
ROLAP does not have the high latency disadvantage of MOLAP. With ROLAP, the data and aggregations are stored in relational format. This means that there will be zero latency between the relational source database and the cube.
Disadvantage of this mode is the performance, this type gives the poorest query performance because no objects benefit from multi dimensional storage.
HOLAP (Hybrid Online Analytical Processing)
HOLAP is a storage type between MOLAP and ROLAP. Data will be stored in relational format(ROLAP), so there will also be zero latency with this storage type.
Aggregations, on the other hand, are stored in multi dimensional format(MOLAP) in the cube to give better query performance. SSAS will listen to notifications from the source relational database, when changes are made, SSAS will get a notification and will process the aggregations again.
With this mode it’s possible to offer zero latency to the users but with medium query performance compared to MOLAP and ROLAP.
To get the real-time reporting without having to reprocess your cube you will need to try out ROLAP, but beware, the performance will suffer (depending on the size of your cube and server!).
I am building a data warehouse. I need to get data from different sources and put it together so that I can generate reports. I will do lots of joining of tables. I am talking about maybe 20 tables total and each table is going to be anywhere from 100mb to 5 gigs.
I would like to know if I should be creating different databases for each table since each table might have an entirely different TYPE of dataset.
For example, I might have one table that has 1 GB of data about design of cars. And I will have another table with 3 GBs of sales data on these cars.
Would it be appropriate to separate these into different databases?
Please let me know what additional information is needed to advise me on this situation.
If there's a logical or business separation, by all means put them in different databases. That's just clean data application development. However, if you're going to be joining or merging the different data sets, then you can save some overhead and admin costs by having a single database. 20 tables total isn't a lot (I'm working on a system that has about 3700 tables, though ~1600 are audits). Keep in mind SQL Server is meant to scale up to terabytes of data, provided you have a decent model, indexes, etc.
If you're concerned with performance of the warehouse, you can jam that server full of RAM and harddrives. To leverage the harddrives properly you'd want to look at leveraging multiple files / filegroups and doling the tables out appropriately.
Splitting into different databases would normally be in order to spread I/O load. In SQL Server you can have different filegroups within the database itself if you want to spread I/O across multiple disks groups/disks. In Warehousing scenarios you often deal with SAN solutions for Database storage, and depending on your scenario, these won't really care performance wise one way or the other, while others might give you additional performance if planned properly.
You also have table partitioning which you can look at for your growing database, but in my opinion, just make sure you have plenty of good old memory, it will benefit you more than spending time and effort in worrying about databases and files.
We are running 100gig databases in a single database file and the performance is stellar. Much of the frequently accesse data is residing in memory though, but with decent table structure and logical indexes you'll have a responsive warehouse in no time.
If you planning on having foreign key relationships between these tables (and it sounds like you would) then I would keep it all in one database. Typically I use separate databases for totally separate bodies of data.
If you do separate them then you will run into some interesting challenges when you try to query both at the same time.
I'm starting to study SQL Server Analysis Services and I'm working my way through the training book, as well as the Developer Training Kit. In both, I find suggestions that the number of tables used in an OLAP database (ideally, star schema) is greatly reduced from the production OLTP database.
From the training kit:
We followed the data dimensional methodology to architect the data mart schema. From some 200 tables in the operational database, the data mart schema contained about 10 dimension tables and 2 fact tables.
From what I understand, the operational databases are usually (somewhat) normalised and the data mart schemas are heavily denormalised. I also believe that denormalising data usually involves adding more tables, not less.
I can't see how you can go from 200 tables to 12, unless you only need to report on a subset of data. And if you do only need to report on a subset of data, why can't you just use the appropriate tables in the operational database (unless there are significant performance gains to be made by using a denormalised star schema)?
Denormalizing is exactly the opposite of Normalizing a database. In a normalized database everything is spit apart into different tables to support concurrent writes to the data. This also has the side effect of generating any given subset of data exactly once (In an ideal 3rd normal form data structrure). A draw back of normalizing is that reads take a lot longer because of the fact that the data is scattered and we need to join tables to make sense of it again (Joins are pretty expensive operations).
When we denormalize, we are taking the data from multiple tables and merging them in to one table. So now we have repeating data in these tables. The repeating data is useful because we don't have to make joins to any other table to get it anymore. Writing to the data store is normally a bad idea because it would mean alot of writes to change all of the data in a table, whereas it would only take one in a normalized database.
OLTP stands for Online Transactional Processing, notice the word Transactional. Transactions are write operations and the OLTP model is optimiized for this. OLAP stands for Online Analytical Processing, Analysis being the keyword meaning lots of reads.
Going from 200 tables to 12 in an OLTP to OLAP process will suprisingly hold nearly all of the data in the OLTP database plus more. The OLTP is unable to record all of the changes over time, but OLAP specializes in this so you get all of your historical data as well as current data.
The star schema is probably the most common for OLAP data stores, the snowflake schema is also pretty common. You should learn about both and how to properly use them. It's just another great tool in your arsenal.
These two books from IBM will answer your questions much more thouroughly and they are free pdf's.