what are the pre-requisites and practices for multidimensional cube Designing ( during analysis phase)? - ssas

I'm assigned to design multidimensional cube in SSAS.
As I am very new to SSAS, and currently this is in analysis phase.
Just wanted to see , is there any standard process or guideline should I follow or any general questions should I prepare prior to cube designing?
One thing client specifically mentioned about the volume of data as
One service area has 3 million rows, 3 years of data
Does it mean, we should plan for partition strategy ? if yes then what are the things should I be looking ? one thing comes in my mind
what field should we consider to split the cube (am I heading in right direction ?)
What are the other factor should I consider during analysis ?

SSAS design is a large topic with different angels. If i were in your shoes, I'd google for "SSAS Design" or something along those lines to learn more. For example, here's a model chapter from a book provided by Microsoft themselves: https://www.microsoftpressstore.com/articles/article.aspx?p=2812063
I'd skip for partitioning at this stage. See how it performs first and tune it later if really necessary. Usually partitioning is done on some accumulating field , like a date, where old data is not processed daily and only the latest data (partition) is updated (processed). This of course depends on the data you're dealing with.

Related

Data model guidance, database choice for aggregations on changing filter criteria

Problem:
We are looking for some guidance on what database to use and how to model our data to efficiently query for aggregated statistics as well as statistics related to a specific entity.
We have different underlying data but this example should showcase the fundamental problem:
Let's say you have data of Facebook friend requests and interactions over time. You now would like to answer questions like the following:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
The general problem is that we have a lot of changing filter criteria (country, topic, interests, time) on both the entities that we want to calculate statistics for and the relevant related entities to calculate these statistics on.
Non-Functional Requirements:
It is an offline use-case, meaning there are no inserts, deletes or
updates happening, instead every X weeks a new complete dump is imported to replace the old data.
We would like to have an upper bound of 10 seconds
to answer our queries. The faster the better max 2 seconds for queries would be great.
The actual data has around 100-200 million entries, growth rate is linear.
The system has to serve a limited amount of concurrent users, max 100.
Questions:
What would be the right database technology or mixture of technologies to solve our problem?
What would be an efficient data model for computing aggregations with changing filter criteria in several dimensions?
(Bonus) What would be the estimated hardware requirements given a specific technology?
What we tried so far:
Setting up a document store with denormalized entries. Problem: It doesn't perform well on general queries because it has to scan too many entries for aggregations.
Setting up a graph database with normalized entries. Problem: performs even more poorly on aggregations.
You talk about which database to use, but it sounds like you need a data warehouse or business intelligence solution, not just a database.
The difference (in a nutshell) is that a data warehouse (DW) can support multiple reporting views, custom data models, and/or pre-aggregations which can allow you to do advanced analysis and detailed filtering. Data warehouses tend to hold a lot of data and are generally built to be very scalable and flexible (in terms of how the data will be used). For more details on the difference between a DW and database, check out this article.
A business intelligence (BI) tool is a "lighter" version of a data warehouse, where the goal is to answer specific data questions extremely rapidly and without heavy technical end-user knowledge. BI tools provide a lot of visualization functionality (easy to configure graphs and filters). BI tools are often used together with a data warehouse: The data is modeled, cleaned, and stored inside of the warehouse, and the BI tool pulls the prepared data into specific visualizations or reports. However many companies (particularly smaller companies) do use BI tools without a data warehouse.
Now, there's the question of which data warehouse and/or BI solution to use.
That's a whole topic of its own & well beyond the scope of what I write here, but here are a few popular tool names to help you get started: Tableau, PowerBI, Domo, Snowflake, Redshift, etc.
Lastly, there's the data modeling piece of it.
To summarize your requirements, you have "lots of changing filter criteria" and varied statistics that you'll need, for a variety of entities.
The data model inside of a DW would often use a star, snowflake, or data vault schema. (There are plenty of articles online explaining those.) If you're using purely BI tool, you can de-normalize the data into a combined dataset, which would allow you a variety of filtering & calculation options, while still maintaining high performance and speed.
Let's look at the example you gave:
Data of Facebook friend requests and interactions over time. You need to answer:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
You want to filter/re-calculate the answers to those questions based on country, topic, interests, time.
One potential dataset can be structured like:
Date of Interaction | Initiating Person's Country | Responding Person's Country | Topic | Interaction Type | Initiating Person's Top Interest | Responding Person's Top Interest
This would allow you to easily count the amount of interactions, grouped and/or filtered by any of those columns.
As you can tell, this is just scratching the surface of a massive topic, but what you're asking is definitely do-able & hopefully this post will help you get started. There are plenty of consulting companies who would be happy to help, as well. (Disclaimer: I work for one of those consulting companies :)

BI Data modeling - Traditional vs new approaches

Dear community,
I hope the headline gives you a hint of what I want to talk about / need advice.
I'm a BI Developer with 3 years of experience working on big BI projects - some on the health industry and some were on the finance industry when I was working at IBM.
On my current job I came to a startup company, the company has an operational DB for the purpose of the product and the data is on SQL Server DB.
For 4 months I was putting fires out regarding all the mass my predecessor did and now I'm ready for the next step - Modeling the operational DB tables for DWH DB to be able to extract and use the data for analytical and BI purposes.
I don't have any resources at all - so I will build the DWH first on the operational DB and then my vision is the DWH will be on Snowflake DB after I will get resources from my CTO.
The modeling issue:
When I'm tackling the issue of data modeling I encountered some confusion about the right way to model data, there is the traditional way I'm familiar with IBM, but there are the Cloud DWH modeling and the hybrid approach.
My model need to be flexible and the data should be extract very fast.
what is the best way to store and extract data for analytical purposes?
Fact tables with a lot of dimensions - normalize approach
OR
putting all the data I need with regard to granularity at the same table (thinking about the future, moving to Snowflake) I will have several tables each one with is one granularity and his world.
I'm just interested to hear what some of you implemented at your company and if you have an advise or UC you can share, I searched at the web a lot and what I saw is a lot of biased info and very confusing - nobody is really saying what is working in the real world.
Thanks in advance!
Well two key points of normalisation are to reduce disk space used and optimise data retrieval; neither of which are all that relevant in Snowflake. Storage is dirt cheap. And for the best part, the database is self-optimised - worse case you might have to set up clustering keys on very large tables (see: https://docs.snowflake.net/manuals/user-guide/tables-clustering-keys.html)
I've found that big tables with lots of columns perform better than many smaller tables with joins. For example when testing on a flat table with 10 mil rows, with a clustering key set up; it was about 180% faster than obtaining the same resultset but with a more complex model / multi-table.
If you're anticipating a lot of writeback and require object level changes, then you should still consider normalisation - but you'd be better off with star schema in that case.

Multidimensional Data Warehouse Alternative for Reporting

I'm a few months into developing a reporting solution. Currently I am loading a relational data warehouse (Fact and Dimension tables) using SSIS. SSAS cubes and dimensions are then created from the relational Data warehouse. I then use SSRS to build reports using MDX queries.
The problem I have is that things are starting to get rather complicated trying to understand how multidimensional modelling works as well as MDX and cubes. Since the organization it's being designed for is rather small, I'm thinking that I should re-evaluate my approach.
I think maybe I should just eliminate SSAS from the picture and simply create reports that report directly off the relational data warehouse using SQL queries. The relational data warehouse could still be loaded nightly to allow up to date data for reporting.
I'm just wondering if that would be a good idea considering I'm not very experienced with data warehousing and SSAS. Also I wanted to know if keeping my relational data warehouse in dimension and fact tables would still work with SQL queries or would I need to redesign the tables. I don't want to make the decision to eliminate SSAS if that will end up causing more headaches or issues.
The reports will not include complicated calculations besides row counts and YTD percentages. For example "How many callers were male?" and "How many callers called for Product A?" Which are then broken down by month.
Any comments or suggestion are much appreciated cause I'm starting to feel rather frustrated with trying get SSAS cubes developed properly.
I was in a similar situation at my company. I had never used SSAS, and I was asked to do research on the benefits of using cubes to do some reporting. It was a pretty steep learning curve because my background is in development not data and reporting. SSAS is most useful when aggregate queries on a relational database are time consuming and if reports need to be broken down into hierarchies that an analyst can use to better understand the state of the business. Since SSAS stores aggregate info, queries of that nature are very quick. If your organization's data is small, the relational queries might be quick enough that you don't really need the benefit of storing aggregates.
Also you need to take into consideration the maintainability of using SSAS. If you're having trouble figuring out SSAS and MDX then how easy of a time will others? I tried to explain an MDX query I wrote to my boss who is experienced with SQL, but it's really quite different from relational queries. How easy is it going to be to add more complex reports?
A benefit to using SSAS is it can put the analyst in control of the report. Second, there are great tools and support. Finally, it's pretty easy to deploy and connect.
You can remove SSAS from your architecture yes because all the results you can get from an MDX query to SSAS, you can get from a T-SQL query to your datawarehouse because the cube was built reading data from the DW. BUT, bear in mind the following: the main advantage on an OLAP cube, in my point of view, are aggregations.
Very simple explanation: lets say you have a fact table called orders with 1 million orders per month. If you want to know how much you sold on that month, using sql you need to read row by row and sum the value to produce the total. That's like 1 million reads on your DB. If you have a cube, with the propper agrregations configured, you can have that value pre-calculated and pre-stored on your cube so if you need to know how much you sold on a month, you will have only one read to your cube.
Its a matter of analyse your situation, if you have a small cube, maybe aggregations are not necessary and you cna do fine with SQL, but depending on the situation, they can be very helpfull

Thoughts on dimension measures for BI

I am working with a consultant who recommends creating a measure dimension and then adding the measure dimension key to our fact table.
I can see how this can make adding new measures easier by just adding rows instead of physically creating columns in the fact table. I can also see how this can add work to the ETL process, adds another join to the star schema, one generic column in fact table to hold all measure data etc.
I'm interested in how others have dealt with this situation. We currently have close to twenty measures.
Instinctively, I don't like it: it's the EAV model, which is not very popular (you can Google the reasons why).
The EAV model is generally considered to be a headache to query and maintain
Different measures go together with different dimensions; this approach could easily turn into "one giant fact table for everything" instead of multiple smaller fact tables for specific reporting areas
I suspect you would end up creating views to give the appearance of multiple fact tables anyway
You will multiply the number of rows in your fact table by the number of measures, resulting in a much bigger physical table
Even with a good indexing/partitioning scheme, queries that include more than one measure will have to read a lot more rows to get the data
What about measures with different data types?
Is this easily supported in your reporting tool?
I'm sure there are other issues, but those are the ones that come to mind immediately. As a rule of thumb, if someone suggests an EAV implementation in any context, you should be very wary and ask them exactly what advantages it offers and how it will be managed as the data and complexity increase. But I think you've already identified some key areas of concern.
SSAS will do this, and I know of a major vendor of insurance policy administration software that provided a M.I. solution for their system that works like this. You do get some flexibility from the approach in that you can add measures without having to deploy a build of the cube, although for 20 measures I don't think you need to worry about that.
'Measures' is essentially another dimension (and often referred to as such in the documentation). I believe SSAS uses a largely column-oriented structure behind the scenes.
However, a naive application of this approach does have some issues that could come and bite you to a greater or lesser extent.
You only have one measure, [Value], [Amount] or whatever it's called. If your tool won't let you inject calculated measures at the front-end then you can't sort the whole data set on the value of one of your attribute types. ProClarity and report builder >=2.0 will do this but Excel won't.
You can't do ratios or other calculated measures in this way. You will have to either embed them in the cube script (meaning you need to deploy a build to add them) or use a tool that lets you define them in the client.
Although it doesn't make a lot of differece to the cube it will be slow to query on the database and increase storage requirements. It's also fiddly to query on the database.

Database Normalisation and Searching it Quickly

I'm working on the technical architecture for a content solution integration. The data from the solution provider runs to millions of rows and normalised to 3NF. It is updated on a regular schedule (daily most likely) and its data is split down to a very granular level of atomicity.
I need to search and query this data and my current inclination is to leave the normalised data alone and create a denormalised database from its data (OLAP to OLTP). The 'transfer' can be a custom built program that can contain the necessary business logic in addition to the raw copying power and be run at a set schedule as required. The denormalised database would then reduce the atomicity and allow the keyword searches and queries to run efficiently. I was looking at using Lucene .NET for the keyword work on the denormalised database.
So before I sing loudly from the hills that this is the way forward, I wanted some expert opinion on this and what is the perceived "best practise". Is the method I have suggested the best way forward considering the data I will be provided? It was suggested that perhaps I could use a 'search engine' to search the normalised data. This scared the hell out of me, but raised the question; what search engine and how?
Opinions, flames, bad language and help appreciated :)
I have built reporting databases and data warehouses based on data stored in normalized form. There is quite a bit of work involved in the transfer program (ETL). Given your description of the data feed, maybe some of that work has been done for you by the feeder.
Millions of rows isn't a lot, these days. You may be able to get away with report oriented views into the existing database. Try it and see.
The biggest benefit to building an OLAP oriented database is not speed. It's flexibility. "We love this report, but now we want to see it weekly and quarterly instead of monthly. Bam! Done!" "Can you break it down by marketing category instead of manufacturing category? Bam! Done!" And so on.
A resonably normalized model (3NF/BCNF) provides the best average performance and the least amount of modification anomalies for the largest number of scenarios. That's big, so I would start from there. As your requirements are fuzzy, it's seems like the most sensible option.
Actually, the most sensible thing would be to go over the requirements until they are a bit more "crisp" ;)
Also, if you could get your hands on a few early extracts from your data provider you could experiment with it and get a feeling for the data distributions (not all people live in one country, and some countries holds more people than others. Not all people have children, and the number children per person is vastly different depending on the country). This is a major point and it is crucial that the optimizer can make good decisions.
Other than that, I agree with everything Walter said and also gave him my vote.