Handling Many to many , 1 to many relation ship between dimensions - dimensions

I have a scenario where one sales guy is related to more than one departments, and I need to calculate the sales at sales rep level and department level. Please share the thoughts on how it can be modelled
My thought process is below
Option 1
I will be creating as 'Sales Rep' dimension and 'Department' dimension and connected it with a bridge table which has dept_id and sales rep_id
Here both the dimensions I prefer to have the history so it is SCD type 2
Option 2
I will be creating 'Sales Rep' dimension and 'Department' dimension and in department dimension, I will be adding the filed " sales rep id". which connects the Sales rep with Department.T he drawback I have observed here is Department details will be repeating in 'Department' table for each employee.
Here both the dimensions I prefer to have the history so it is SCD type 2
Please share your answer, the above options which one is better, or any other third best approach -

This answer is related to the business model more than to technological needs:
Options 2 makes the best sense if the sales person could belong to more than one department, keep the department at the "sales" fact table, and then no need to keep the department in the "sales person" dimension.
Option 1 makes the best sense if the sales person belongs only to one department at a time, but he might change departments, make this a Slowly Changing Dimension Type 2 in which you keep the history.
Slowly changing dimension means you don't need a bridge table, the department is part of the "sales person" table, and you can read more about it in the link provided.
In the odd case that a sales person can work in several departments and have people from various departments reporting to him, then all the hierarchical model should be in a different table. In SSAS a self-reflecting table doesn't work well, try to check ways in which to flatten those issues.
Please note that when you're designing a data warehouse the star schema means exactly that: data might repeat itself in different tables in order to make the reporting easier.
Those issues never have a clear cut solution and I advise you to read as much as you can on data warehouse design until your head spins in order to get your head around this.

Related

NHS Data Warehouse

I have got coursework, which I do not understand, I tried emailing my tutor but he did not respond and I have been waiting for about 2 months now... I am supposed to create a Star/Snowflake Schema focusing on 2 fact tables.
The project must focus on the NHS, we are free to define the scope so I decided to focus on COVID-19. I have created a star schema for 1 fact table, which is called "Deaths", my idea is the data warehouse to show which areas have the highest death rate so that the NHS knows which areas are in demand in order to manage the situation accordingly.
I was thinking, the second Fact table to be Infection/Infected, which is supposed to see which areas have the highest infection rates. I think that it would not work because the dimension for "Infected" should be different than the ones for deaths( I am not sure if they have to be the same)?
Could you share with me your thoughts and recommendation?
Here is the assignment brief and below the brief is my star schema design(Which I think is wrong).
I don't see the need of having two facts one for recovered and one for death cases.
You can have an only one FactDiagnosticAnalysis gathering :
TreatmenCenterSK
PatientSK
TreatmenSK
StaffSK
DiagnosticSK
DateSK
Result
InsertedDate : a technical column to capture when the record was insterted
The Result column will have the values : Infected,Not Infected, Recovered,Dead at a specific date since :
a patient will have many analysis until his recovery
a patient can be not infected when he arrives after doing the
analysis
a patient will be recovered after many analysis
a patient can die after many analysis
Your model can be like below :
Actually, in this case your fact is a factless fact.
A factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts.
You ceate the measures in your reports/dashboards as views (if you are using SQL):
Area having the highest death rate
The number medical centers reaching their maximum capacities

Put hierarchy in its own separate dimension or set it as a part of related dimension?

I'm new to the dimensional model in dataware house design, and I face some confusion in my first design.
I take a simple business process (Vacation Request), and I want to ask which of these two designs is accurate and near to be effective and applicable, I will be grateful if I get a detailed answer please? (My question about the dimension design mainly)
1-
Dimension.Employee Fact.Vacation
[Employee Key] [Employee Key] FK PK
[_source key] [Vacation Transaction]PK DD
[Employee Name] ...
.... ...
[Campus Code]
[Campus Name]
[Department Code]
[Department Name]
[Section code]
[Section Name]
....
2-
Dimension.Employee Dimension.Section Fact.Vacation
[Employee Key] [Section Key] [Employee Key] FK PK
[_source key] [_source key] [Vacation Transaction]PK DD
[Employee Name] [Department Code] [Section Key]FK
.... [Department Name] ...
.... [Campus Code]
[Campus Name]
Where the hierarchy is like this:
Campus Contains -->
Many Departments and each department contains -->
many sections and each section contains many employees
Good question! I've faced this situation a number of times myself. I'm afraid this is going to get a bit confusing and the final answer will be, "it depends", but here are some things to consider...
WHAT A STAR SCHEMA REALLY IS: While people think of a data warehouse as a reporting database, it actually performs two functions: data integration and data distribution. It turns out that the best data structures for integration are not great for distribution (here's a blog post I wrote about this some years ago). Star schemas are really about data distribution - getting data out of the system quickly and easily. The best data structures for this have no joins, i.e. they are akin to flat files (yes, I realize there are some DB buffering considerations that might affect this a bit but, in a general sense, indexed flat files do avoid all joins).
The star schema takes that flat file and normalizes it a little, largely to save disk space (it's a huge space waster when you have to write out every attribute of each dimension on every record). So, when people say a star schema is denormalized, they are partially incorrect. The dimension tables are denormalized (snowflake schemas normalize these) but the fact table is normalized - it's got a bunch of attributes dependent on a unique primary key.
So, this concept would point to minimizing the number of dimensions in order to minimize the number of joins you need to make. Point for putting them into one dimension.
FACT TABLE SHOWS RELATIONSHIPS: Your fact table shows the relationship between otherwise unrelated dimension elements. For example, in the absence of a sale, there is no relationship between a product and a customer. A sale creates that relationship and the sale fact record models it. In your case, there is a natural relationship between section and employee (I assume, at least). You don't need a fact table to model this relationship and, therefore, they should both be in one dimension table. Another point for putting them into one dimension.
CAN AN EMPLOYEE BE IN MULTIPLE SECTIONS SIMULTANEOUSLY?: If an employee can be in multiple sections at the same time then you probably do need the fact table to model this relationship (otherwise, each employee would need two active records in the employee dimension table). Point for putting them into separate dimensions.
DO EMPLOYEES CHANGE SECTIONS FREQUENTLY?: If so, and you only have one dimension, you'll end up having to constantly be modifying the employee record in your employee dimension - this leads to a longer than needed dimension table if this is a type two slowly changing dimension (i.e. one where you're tracking the history of changes to dimension elements). Point for putting them into separate dimensions.
DO YOU NEED TO AGGREGATE BY SECTION?: If you have a lot of sections and frequently report at the section level, you may need to create a fact table aggregated at the section level. In this case, if you're a staunch believer in having your DB enforce your relationships (I am), you'll need a section table to which your fact table can relate. In this case you'll need a section table. Point for putting them into separate dimensions.
WILL THE SECTION DIMENSION BE USED BY OTHER FACT TABLES?: One tough situation with star schemas occurs when you're using conformed dimensions (i.e. dimensions that are shared by multiple fact tables). The problem occurs when the different fact tables are defined at different levels in the dimension hierarchy. In your case, imagine there is a different fact table, say one modeling equipment purchases and it only makes sense at the section, not the employee, level. In this case, you'd probably split the section dimension into its own table so it can be shared by both fact tables, your current one and that future, equipment one. This, BTW, is similar to the consideration related to aggregate tables mentioned earlier. Point for putting them into separate dimensions.
Anyhow, that's off the top of my head. So, I guess the answer is, "it depends". Both will work, it just depends on other factors you're trying to optimize for. I'll try to edit my answer if anything else comes to mind. Best of luck with this!
The second.
Employees are a WHO, departments are a WHERE.

Filtering Tabular Model Dimension Across Multiple Facts

I am building a Tabular model in an educational environment, and am running into some difficulty around filtering the lookup table across multiple base tables.
I've read about bridge tables, which I believe I could use, but I have over a dozen fact tables, so the amount of corresponding bridge tables seems cumbersome to manage. Could I handle the filtering via DAX instead? I've been working to learn more DAX, but am new and stumped so far.
Here is a sample scenario of what I'm trying to accomplish: to simplify things, I have a dimension called Student which contains a row for each student. I have a fact called Discipline which contains discipline incident records for each student (1 per incident, and not all students will have a discipline incident). Also, I have a fact table called Assessment which contains assessment records for a test (1 per assessment completed by a student). Some students won't take this assessment, so they will have no corresponding scores.
When I model this data, in a Pivot Table, for example, to analyze a correlation between discipline and assessments, I bring in a measure called Discipline Incident Count (count of discipline incident records) and Assessment Average Score (average of assessment scores). I'm wanting to only view a list of students that have values for both, but I get something like the following:
Student Name --------Discipline Incident Count--------Assessment Average Score
Student A-------------------(blank)------------------------------85.7
Student B----------------------3-------------------------------(blank)
Student C----------------------2---------------------------------88.7
In this case, I would want my result set to only include student C, since they have a value for both. I have also tested handling the filtering on the application layer (Excel), by filtering out blanks in each column, but with the real data, which might have nested values and a large amount of data, doesn't seem to be working well.
Thanks for any feedback!
So, filtering this in the application layer is probably your best bet.
Otherwise, you'll have to write your measures such that each one checks the values of all other measures to determine whether to display.
An example for the sample you've shares is as follows:
ConditionalDiscipline:=
SWITCH(
TRUE()
,ISBLANK( [Assessment Average Score] )
,BLANK()
,[Discipline Incident Count]
)
SWITCH() is just syntactic sugar for nested IF()s.
Now, this measure would only work when your pivot table consists of [Discipline Incident Count] and [Assessment Average Score]. If you add another measure from a new fact table, the presence of a value for the student on the pivot table row will have no effect on the display of that row.
You'd need a new measure:
ConditionalDiscipline - version 2:=
SWITCH(
TRUE()
,ISBLANK( [Assessment Average Score] )
,BLANK()
,ISBLANK( [Your new measure] )
,BLANK()
,[Discipline Incident Count]
)
Now this version 2 will only work if both [Assessment Average Score] and [Your new measure] are non-blank. If you use this version 2 in the sample you've posted without [Your new measure], then it will still return blank for the students that have no entry for [your new measure].
Thus, you'll need one version of this for each possible combination of other measures that would be used with it.
There is no way to define "the set of all other measures currently being evaluated in the same context" in DAX, which is what you really need to test against for the existence of blanks.
You could, rather than using a SWITCH(), just multiply all the other measures together in a single test in IF(), because multiplication by blank yields blank. This does nothing for the combinatorial number of measures you'll need to write. You've got a lot of options for exactly how you define your test, but they all would end up needing the same absurd number of separate measure definitions.
I'd also like to address another point in your question. You mentioned using a bridge table. A bridge is only necessary when you implement a many-to-many relationship between dimension tables. You are not implementing this. You have several facts with a single conformed dimension. The conformed dimension is essentially a bridge between your facts, each fact existing in a many-to-many relationship with the other facts, based on the StudentKey. Any bridge table is superfluous.
The long story, shortened, is no. There are not introspective facilities in DAX sufficient to do what you want. Handle it at the application layer.

Star schema for target and actual comparison Kimball

I am going to model one of the star schemas for a university data warehousing project. We need to compare the actual application count with a target.
There are target counts (set by the colleges every year) associated with Departments, Course groups, and Courses.
The requirement is to ensure that the targets set get correctly allocated and also the progress of applications against the target.
One proposal is to include all the actual counts (department level total accepted count, course group level total accepted count, course level total accepted count) and corresponding target counts (dept level target counts, course group target counts, course target counts) in single fact table. One of the dimensions in this star schema is Course dimensions. It consists of all the
course, course group and department information. I do understand a granularity problem here, but this could be handled at the cube implementation level.
Or if I want to set the target at different hierarchy levels of a dimension, should I build different fact tables? As mentioned below:
Implement 3 fact tables for 3 different types of targets and connect these fact tables to the actual fact table. In this situation, the Course dimension can snowflake into course group dimension and department dimensions. First fact is connected to course. Second fact to course group and the 3rd one to dept. The actual fact table is of granularity of course level so all three fact tables can be connected to the actual fact table via course. Note that the actual fact table is of grain course-level and this can be aggregated to get higher level such as course group and dept actual counts.
Data Architects please comment!

SSAS Cube Modeling

Two part architecture question:
I have employee, job title, and supervisor dimensions. I kind of wanted to keep them in one dimension and have something like site > supervisor > job title > employee. The problem is that these need to be SCD. That is, they have historical associations to relate to the facts. The fact tables have a requirement to be processed every five minutes (dashboard).
1) Should I have these in a single dimension with a surrogate key (or composite for that matter)? The keys/surrogate key would be composed of calendar_id - employee_id.
2) Have the fact tables have maintain a reference to three different dimensions instead?
The requirement to process every 5 minutes (MOLAP SSIS ETL driven processing). Makes me lean toward keeping the time/change in the facts so that I would ease having to process the dimensions along with the fact tables.
I would design it as a single dimension, with the hierarchy you mentioned: site > supervisor > job title > employee.
Let's call this dimension EmployeeAssignment, because its granularity is not Employees, but any combination of site/supervisor/job title that an employee "adopts" during his/her career. (Feel free to come up with a better name).
I don't think you need a calendar_id key in this dimension: a surrogate key based on DISTINCT SiteID,SupervisorID,JobTitleID,EmployeeID would be enough. Adding a calendar_id key would be making the dimension do too much work: over and above slicing the actual facts, this would make the dimension answer questions like
"Where was employeeID 12345 (in the site/supervisor/job title network) on 1 January 2015?" and
"How many employees did supervisorID 98765 supervise on 1st January 2015?"
These questions IMHO are best addressed with a fact, not a dimension. One cube I've worked on addresses with with an EmployeeDay measure: sliced by dimensions "EmployeeAssignment" and Time, this simply has a 1 if the employee is in that "assignment" on that day.
This EmployeeAssignment SCD is actually pretty slowly-changing, especially compared to your 5-minute fact update interval. Employees are not going to move about or get promoted every 5 minutes, so a reprocess of the dimension shouldn't be necessary more often than daily.
If I've misunderstood anything, let me know in the comments.