Currently, having a discussion with coworkers on whether or not age is an attribute or dimension or both?
We are unable to come to an agreement on this, as age could be used as a category attribute while age could also be used as a measure to be averaged.
I am sure others have run into this, not only with age but with other fields that were on the border of attribute and measure.
How is this best handled?
Age or AgeGroup is normally its own dimension. So you would have an Age dimension that would have ages from say 1 to 150 and it would have an attribute for AgeGroup (eg 20-25).
When it comes to deciding between dimension attribute or measures, you need to meet your requirements. Sometimes you end up having something as both dimension attribute and measure in the fact table.
Read this design tip from Kimball which explains the situation with examples:
Modelling data as both dimension and fact
Related
I have got coursework, which I do not understand, I tried emailing my tutor but he did not respond and I have been waiting for about 2 months now... I am supposed to create a Star/Snowflake Schema focusing on 2 fact tables.
The project must focus on the NHS, we are free to define the scope so I decided to focus on COVID-19. I have created a star schema for 1 fact table, which is called "Deaths", my idea is the data warehouse to show which areas have the highest death rate so that the NHS knows which areas are in demand in order to manage the situation accordingly.
I was thinking, the second Fact table to be Infection/Infected, which is supposed to see which areas have the highest infection rates. I think that it would not work because the dimension for "Infected" should be different than the ones for deaths( I am not sure if they have to be the same)?
Could you share with me your thoughts and recommendation?
Here is the assignment brief and below the brief is my star schema design(Which I think is wrong).
I don't see the need of having two facts one for recovered and one for death cases.
You can have an only one FactDiagnosticAnalysis gathering :
TreatmenCenterSK
PatientSK
TreatmenSK
StaffSK
DiagnosticSK
DateSK
Result
InsertedDate : a technical column to capture when the record was insterted
The Result column will have the values : Infected,Not Infected, Recovered,Dead at a specific date since :
a patient will have many analysis until his recovery
a patient can be not infected when he arrives after doing the
analysis
a patient will be recovered after many analysis
a patient can die after many analysis
Your model can be like below :
Actually, in this case your fact is a factless fact.
A factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts.
You ceate the measures in your reports/dashboards as views (if you are using SQL):
Area having the highest death rate
The number medical centers reaching their maximum capacities
I’ve been asked create our analysis cube and have a design question.
We sell ‘widgets’ and ‘parts’ to go with those widgets. Each order has many widgets and sometimes a few parts.
What I’m stuck on is – to me, an order is a fact in a measure. But, what are the widgets? Are they a dimension and each fact in the measure will be an entry for every part and widget for the order.
So, if order 123 had widget 1 and widget 2 and part 5, then there will be 3 facts in the measure for the same order? Is that correct?
At its basic level you can consider most facts to be transactions or transaction line items. So, for example, you may have a 'sales' fact table in which each record represents one line item from that sale. Each fact record would have numeric columns representing metrics and other columns joining to dimension tables. The combination of those dimensions would describe that line item. So, in your case, you likely have something like:
1) A 'date' dimension detailing the date of the transaction
2) A 'widget' dimension detailing the widget sold on that transaction
3) A 'customer' dimension detailing the customer who bought that item (almost certainly the same customer would appear on every line item for this transaction)
4) ... determined by what information you have and what business problem you're trying to solve.
Now, the dimension tables contain further details. For example, your widget dimension table likely contains things like the name of the widget, the color, the manufacturer, etc. Every time your company sells one of these widgets, the record in the fact table links to that same dimension record for that name, color, manufacturer, etc. combination (i.e. you don't create a new dimension record every time you sell the same item - this is a one-to-many relationship - each dimension record may have many related fact records).
You other dimension tables would similarly describe their dimensions. For example, the customer dimension might give the customer's name, their address, ...
So, the short answer to your question is that widget likely is a dimension, items and widgets may (or may not) actually be the same dimension (in a school class I suspect that they are), and that you would have 3 fact records for that one transaction.
This is probably along the same lines as the prior answer but....
If you try and model "many widgets per order" you'll have issues because you end up with a many (order fact) to many (widgets) relationship. In a cube / star schema design, many to many relationship usually need to be moddeled out to be many to one in some way.
So what you do is try and identify what special thing identifies an "order" (as opposed to a bunch of widgets in an order). Usually that is simply stuff like order date, customer, order number, tax
An example way to model this is:
If you have a single order with five widgets, you model that as a fact table with five records that happens to have a repeating widget, customer, date etc. in it
Then you have to work out how you spread an order header tax amount over five records. The two obvious solutions are:
Create a widget that represents tax and add that as another record
Spread the tax over five records, either evenly or weighted by something
Modelling "parts" just takes these concepts further.
It is important to understand what the end user wants to see, why they want to see parts. What do they want to measure by parts, how do you assign higher level values (like tax) down to lower levels like parts.
I have a scenario where one sales guy is related to more than one departments, and I need to calculate the sales at sales rep level and department level. Please share the thoughts on how it can be modelled
My thought process is below
Option 1
I will be creating as 'Sales Rep' dimension and 'Department' dimension and connected it with a bridge table which has dept_id and sales rep_id
Here both the dimensions I prefer to have the history so it is SCD type 2
Option 2
I will be creating 'Sales Rep' dimension and 'Department' dimension and in department dimension, I will be adding the filed " sales rep id". which connects the Sales rep with Department.T he drawback I have observed here is Department details will be repeating in 'Department' table for each employee.
Here both the dimensions I prefer to have the history so it is SCD type 2
Please share your answer, the above options which one is better, or any other third best approach -
This answer is related to the business model more than to technological needs:
Options 2 makes the best sense if the sales person could belong to more than one department, keep the department at the "sales" fact table, and then no need to keep the department in the "sales person" dimension.
Option 1 makes the best sense if the sales person belongs only to one department at a time, but he might change departments, make this a Slowly Changing Dimension Type 2 in which you keep the history.
Slowly changing dimension means you don't need a bridge table, the department is part of the "sales person" table, and you can read more about it in the link provided.
In the odd case that a sales person can work in several departments and have people from various departments reporting to him, then all the hierarchical model should be in a different table. In SSAS a self-reflecting table doesn't work well, try to check ways in which to flatten those issues.
Please note that when you're designing a data warehouse the star schema means exactly that: data might repeat itself in different tables in order to make the reporting easier.
Those issues never have a clear cut solution and I advise you to read as much as you can on data warehouse design until your head spins in order to get your head around this.
Two part architecture question:
I have employee, job title, and supervisor dimensions. I kind of wanted to keep them in one dimension and have something like site > supervisor > job title > employee. The problem is that these need to be SCD. That is, they have historical associations to relate to the facts. The fact tables have a requirement to be processed every five minutes (dashboard).
1) Should I have these in a single dimension with a surrogate key (or composite for that matter)? The keys/surrogate key would be composed of calendar_id - employee_id.
2) Have the fact tables have maintain a reference to three different dimensions instead?
The requirement to process every 5 minutes (MOLAP SSIS ETL driven processing). Makes me lean toward keeping the time/change in the facts so that I would ease having to process the dimensions along with the fact tables.
I would design it as a single dimension, with the hierarchy you mentioned: site > supervisor > job title > employee.
Let's call this dimension EmployeeAssignment, because its granularity is not Employees, but any combination of site/supervisor/job title that an employee "adopts" during his/her career. (Feel free to come up with a better name).
I don't think you need a calendar_id key in this dimension: a surrogate key based on DISTINCT SiteID,SupervisorID,JobTitleID,EmployeeID would be enough. Adding a calendar_id key would be making the dimension do too much work: over and above slicing the actual facts, this would make the dimension answer questions like
"Where was employeeID 12345 (in the site/supervisor/job title network) on 1 January 2015?" and
"How many employees did supervisorID 98765 supervise on 1st January 2015?"
These questions IMHO are best addressed with a fact, not a dimension. One cube I've worked on addresses with with an EmployeeDay measure: sliced by dimensions "EmployeeAssignment" and Time, this simply has a 1 if the employee is in that "assignment" on that day.
This EmployeeAssignment SCD is actually pretty slowly-changing, especially compared to your 5-minute fact update interval. Employees are not going to move about or get promoted every 5 minutes, so a reprocess of the dimension shouldn't be necessary more often than daily.
If I've misunderstood anything, let me know in the comments.
I am fairly new to OLAP and SSAS but seasoned in relational data wharehouses, and my question about Reference Dimensions is are they bad, a necessary evil, or useful when used correctly? Every single post I can find references Andventure Works and the Geography dimension, but I am looking for real world experience.
My cube has a date dimension that is pretty standard, and I want to create a date metrics REFERENCE dimension that has has a FK DateId to my Date Dimension. Inside of this Date Metrics reference dimension I will add a member for AccountId and several "Action" members to summarize specific actions that I want to count by date, month, or year etc.
At the root of it my date metric reference will be uniuqe on the DateID AND the AccountId which will enable me to summarize "action" movement by the date dimension I am trying to relate it back to.
Do I Have this all wrong?
Reference dimensions: means how you reference a dimension to a cube. When dimensions are created, they exist on they own, you add them to a cube on the "Dimension Usage" tab. This is necessary to be able to browse the cube's data using the dimension.
I think you are actually asking about "attribute relationship" (second tab on the dimension configuration), and the answer is they are extremely useful. I even saw a video once with a Microsoft MVP and he was saying that its probably the most important configuration you can do on your cube.
The attribute relationship indicate how the attributes on a dimension relate among themselves. So, for example, on a date dimension you will have
day -> month -> quarter -> year
its always the "opposite" configuration as if it were an hierarchy.
Another very important configuration is the relation type, which in the date example, you should set it to rigid, because the data will never change (the 01-01-2012 member will always belong to the 2012 year) so SSAS will maintain the calculated aggregations when you process the cube (unless, of couse, you do a full processing)