I come from a relational SQL Server database background, and am trying to make the transition to a multi-dimensional model in Analysis Services.
I'm struggling with how to approach the following problem, which would be incredibly simple in the relational world.
I have 3 tables - Incident, IncidentOffender, and IncidentLoss. There may be none, one, or many IncidentOffenders and IncidentLosses to an Incident:
How can I design my data warehouse such that I will be able to ask the cube, for example, "how much time did we spend dealing with incidents on which a bald offender stole baked beans?", as well as "what was the value of those beans?"?
Apologies if this sounds simple, but I've scoured the web and devoured various books, but still I cannot find a real-life example of anything like this, which seems like an everyday situation to me.
In your scenario, all three tables need to be loaded into SSAS as both dimensions and measure groups. Then the Incident Offender and Incident Loss dimensions can be many-to-many dimensions for the Incident measure group. It will look like the following in the Dimension Usage tab.
Related
I'm assigned to design multidimensional cube in SSAS.
As I am very new to SSAS, and currently this is in analysis phase.
Just wanted to see , is there any standard process or guideline should I follow or any general questions should I prepare prior to cube designing?
One thing client specifically mentioned about the volume of data as
One service area has 3 million rows, 3 years of data
Does it mean, we should plan for partition strategy ? if yes then what are the things should I be looking ? one thing comes in my mind
what field should we consider to split the cube (am I heading in right direction ?)
What are the other factor should I consider during analysis ?
SSAS design is a large topic with different angels. If i were in your shoes, I'd google for "SSAS Design" or something along those lines to learn more. For example, here's a model chapter from a book provided by Microsoft themselves: https://www.microsoftpressstore.com/articles/article.aspx?p=2812063
I'd skip for partitioning at this stage. See how it performs first and tune it later if really necessary. Usually partitioning is done on some accumulating field , like a date, where old data is not processed daily and only the latest data (partition) is updated (processed). This of course depends on the data you're dealing with.
Problem: I am working with a SaaS company that provides monthly services. We are trying to create a data model to track customer related metrics such as count, signups, cancellations, and reactivations. I’ve done extensive research online, but the closest I’ve found is accumulating snapshots with start/end dates, which doesn’t make sense with a SaaS company where a customer can reactivate an account.
My initial thought is to create a Factless Fact table for customer, however this factless table would also have keys to event dimension tables, I.e. DimSignupType, DimCancellationType, DimReactivationType, etc and boolean measures for isSignup, isCancellation, and isReactivation. I think this is counterintuitive because a factless fact table shouldn’t have facts, but I need track those and feel multiple fact tables is worse because I would have to join them together in the view.
Is there a better approach to this problem?
Edit based on feedback: The main goal of this is to create a dimensional model that is maintainable, but also something I create a view for with other dimensional tables that allows less technical users to discover insights with tools like Tableau. At the end of the day I need to provide a large flat view with multiple measures and dimensions that allows for easy analytical discovery. Common questions may be, "How many signups do we have MTD for this customer type vs last mtd?", "How many cancelations did we have due to Non-Payment this month compared to last", "How many reactivations from Non-Payment did we have this month compared to last?", etc. A lot of this meta data comes from Dimension tables I would join to the factless fact table based on keys, however it still requires a focus on Signups, Cancellations and Reactivations being tracked as Facts for reporting purposes. So I don't know the best modelling approach for it that abides by traditional standards. It almost seems like a Snapshot Fact Table that contains keys to dimensional tables that describe events to be aggregated. I just don't know what that would be called.
I feel the most flexible solution in terms of data management and ease of use would be a factless fact table modeled in a daily snapshot manner with "facts" for signup, cancellation and, reactivations that link to types.
Problem:
We are looking for some guidance on what database to use and how to model our data to efficiently query for aggregated statistics as well as statistics related to a specific entity.
We have different underlying data but this example should showcase the fundamental problem:
Let's say you have data of Facebook friend requests and interactions over time. You now would like to answer questions like the following:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
The general problem is that we have a lot of changing filter criteria (country, topic, interests, time) on both the entities that we want to calculate statistics for and the relevant related entities to calculate these statistics on.
Non-Functional Requirements:
It is an offline use-case, meaning there are no inserts, deletes or
updates happening, instead every X weeks a new complete dump is imported to replace the old data.
We would like to have an upper bound of 10 seconds
to answer our queries. The faster the better max 2 seconds for queries would be great.
The actual data has around 100-200 million entries, growth rate is linear.
The system has to serve a limited amount of concurrent users, max 100.
Questions:
What would be the right database technology or mixture of technologies to solve our problem?
What would be an efficient data model for computing aggregations with changing filter criteria in several dimensions?
(Bonus) What would be the estimated hardware requirements given a specific technology?
What we tried so far:
Setting up a document store with denormalized entries. Problem: It doesn't perform well on general queries because it has to scan too many entries for aggregations.
Setting up a graph database with normalized entries. Problem: performs even more poorly on aggregations.
You talk about which database to use, but it sounds like you need a data warehouse or business intelligence solution, not just a database.
The difference (in a nutshell) is that a data warehouse (DW) can support multiple reporting views, custom data models, and/or pre-aggregations which can allow you to do advanced analysis and detailed filtering. Data warehouses tend to hold a lot of data and are generally built to be very scalable and flexible (in terms of how the data will be used). For more details on the difference between a DW and database, check out this article.
A business intelligence (BI) tool is a "lighter" version of a data warehouse, where the goal is to answer specific data questions extremely rapidly and without heavy technical end-user knowledge. BI tools provide a lot of visualization functionality (easy to configure graphs and filters). BI tools are often used together with a data warehouse: The data is modeled, cleaned, and stored inside of the warehouse, and the BI tool pulls the prepared data into specific visualizations or reports. However many companies (particularly smaller companies) do use BI tools without a data warehouse.
Now, there's the question of which data warehouse and/or BI solution to use.
That's a whole topic of its own & well beyond the scope of what I write here, but here are a few popular tool names to help you get started: Tableau, PowerBI, Domo, Snowflake, Redshift, etc.
Lastly, there's the data modeling piece of it.
To summarize your requirements, you have "lots of changing filter criteria" and varied statistics that you'll need, for a variety of entities.
The data model inside of a DW would often use a star, snowflake, or data vault schema. (There are plenty of articles online explaining those.) If you're using purely BI tool, you can de-normalize the data into a combined dataset, which would allow you a variety of filtering & calculation options, while still maintaining high performance and speed.
Let's look at the example you gave:
Data of Facebook friend requests and interactions over time. You need to answer:
In 2018 which American had the most German friends that like ACDC?
Which are the friends that person X most interacted with on topic Y?
You want to filter/re-calculate the answers to those questions based on country, topic, interests, time.
One potential dataset can be structured like:
Date of Interaction | Initiating Person's Country | Responding Person's Country | Topic | Interaction Type | Initiating Person's Top Interest | Responding Person's Top Interest
This would allow you to easily count the amount of interactions, grouped and/or filtered by any of those columns.
As you can tell, this is just scratching the surface of a massive topic, but what you're asking is definitely do-able & hopefully this post will help you get started. There are plenty of consulting companies who would be happy to help, as well. (Disclaimer: I work for one of those consulting companies :)
We're starting to build an SSAS tabular model and wondering if most people have one model or multiple. If multiple, do you duplicate tables that are needed by each, or is there a way to share tables between models? I think I know the answer, but I'm hoping those with more experience can confirm what we've found...
From what I've researched I think...
- you can't share tables across models - any "common" tables would have to be duplicated in and deployed with each model and would take up memory
- we should create one model, use perspectives to organize the tables and make it easier to work with
- multiple models could be acceptable if there is little or no common data across models
thanks
You are correct, there is no way to share tables between models.
Perspectives can help.
The question of whether to have one model or more depends on the user audience. Who are the users? How analytically savvy are they? Will they have a reasonable understanding of the model structure?
One issue that affects my rather unsophisticated users is when a dimension does not relate to all fact tables. In this scenario, as is expected, measures on the fact table calculate identical values for every member of the unrelated dimension. For less knowledgeable users, this situation is confusing.
I agree with Ari's answer, and am posting this answer to explain my own experience.
We use a few large models for more sophisticated users that are in memory and processed once a week. We have agreed with the business that these models will not be available during processing, so we are able to processes without holding a transaction open which allows us to keep many more smaller models because we do not need to keep 2x the size of our largest model available to the instance. We use perspectives to simplify the presentation and reduce the confusion for the multiple fact tables. Even with perspectives, the models are rather complex, and it takes some training to get users used to working with the different facts.
We also use smaller models, usually more targeted to a specific audience/need. Many are processed daily, and use transactional processing to ensure the are available to users as much as possible. There are several dimensions that are used in several of our small models, but we are able to filter them so that user's do not see the full list of members which reduces size, and has been a huge benefit for my users, because they only see members that have a fact that they are analyzing instead of a list of every member associated with any fact.
We use views to ensure conformity between models when a dimension is used in multiple models. In my opinion this is very important, as it is very confusing when I have the same dimension with slightly different attribute names.
To sum up (pun intended)...
I like developing and working with large models. I think they answer more questions with less work.
Most users I have worked with prefer smaller, more concise models. Your server hardware/processing requirements may direct you to smaller models as well, even though some of the dimensions will be duplicated.
I'm a few months into developing a reporting solution. Currently I am loading a relational data warehouse (Fact and Dimension tables) using SSIS. SSAS cubes and dimensions are then created from the relational Data warehouse. I then use SSRS to build reports using MDX queries.
The problem I have is that things are starting to get rather complicated trying to understand how multidimensional modelling works as well as MDX and cubes. Since the organization it's being designed for is rather small, I'm thinking that I should re-evaluate my approach.
I think maybe I should just eliminate SSAS from the picture and simply create reports that report directly off the relational data warehouse using SQL queries. The relational data warehouse could still be loaded nightly to allow up to date data for reporting.
I'm just wondering if that would be a good idea considering I'm not very experienced with data warehousing and SSAS. Also I wanted to know if keeping my relational data warehouse in dimension and fact tables would still work with SQL queries or would I need to redesign the tables. I don't want to make the decision to eliminate SSAS if that will end up causing more headaches or issues.
The reports will not include complicated calculations besides row counts and YTD percentages. For example "How many callers were male?" and "How many callers called for Product A?" Which are then broken down by month.
Any comments or suggestion are much appreciated cause I'm starting to feel rather frustrated with trying get SSAS cubes developed properly.
I was in a similar situation at my company. I had never used SSAS, and I was asked to do research on the benefits of using cubes to do some reporting. It was a pretty steep learning curve because my background is in development not data and reporting. SSAS is most useful when aggregate queries on a relational database are time consuming and if reports need to be broken down into hierarchies that an analyst can use to better understand the state of the business. Since SSAS stores aggregate info, queries of that nature are very quick. If your organization's data is small, the relational queries might be quick enough that you don't really need the benefit of storing aggregates.
Also you need to take into consideration the maintainability of using SSAS. If you're having trouble figuring out SSAS and MDX then how easy of a time will others? I tried to explain an MDX query I wrote to my boss who is experienced with SQL, but it's really quite different from relational queries. How easy is it going to be to add more complex reports?
A benefit to using SSAS is it can put the analyst in control of the report. Second, there are great tools and support. Finally, it's pretty easy to deploy and connect.
You can remove SSAS from your architecture yes because all the results you can get from an MDX query to SSAS, you can get from a T-SQL query to your datawarehouse because the cube was built reading data from the DW. BUT, bear in mind the following: the main advantage on an OLAP cube, in my point of view, are aggregations.
Very simple explanation: lets say you have a fact table called orders with 1 million orders per month. If you want to know how much you sold on that month, using sql you need to read row by row and sum the value to produce the total. That's like 1 million reads on your DB. If you have a cube, with the propper agrregations configured, you can have that value pre-calculated and pre-stored on your cube so if you need to know how much you sold on a month, you will have only one read to your cube.
Its a matter of analyse your situation, if you have a small cube, maybe aggregations are not necessary and you cna do fine with SQL, but depending on the situation, they can be very helpfull