Put hierarchy in its own separate dimension or set it as a part of related dimension?

Put hierarchy in its own separate dimension or set it as a part of related dimension? - sql

I'm new to the dimensional model in dataware house design, and I face some confusion in my first design.
I take a simple business process (Vacation Request), and I want to ask which of these two designs is accurate and near to be effective and applicable, I will be grateful if I get a detailed answer please? (My question about the dimension design mainly)
1-
Dimension.Employee Fact.Vacation
[Employee Key] [Employee Key] FK PK
[_source key] [Vacation Transaction]PK DD
[Employee Name] ...
.... ...
[Campus Code]
[Campus Name]
[Department Code]
[Department Name]
[Section code]
[Section Name]
....
2-
Dimension.Employee Dimension.Section Fact.Vacation
[Employee Key] [Section Key] [Employee Key] FK PK
[_source key] [_source key] [Vacation Transaction]PK DD
[Employee Name] [Department Code] [Section Key]FK
.... [Department Name] ...
.... [Campus Code]
[Campus Name]
Where the hierarchy is like this:
Campus Contains -->
Many Departments and each department contains -->
many sections and each section contains many employees

Good question! I've faced this situation a number of times myself. I'm afraid this is going to get a bit confusing and the final answer will be, "it depends", but here are some things to consider...
WHAT A STAR SCHEMA REALLY IS: While people think of a data warehouse as a reporting database, it actually performs two functions: data integration and data distribution. It turns out that the best data structures for integration are not great for distribution (here's a blog post I wrote about this some years ago). Star schemas are really about data distribution - getting data out of the system quickly and easily. The best data structures for this have no joins, i.e. they are akin to flat files (yes, I realize there are some DB buffering considerations that might affect this a bit but, in a general sense, indexed flat files do avoid all joins).
The star schema takes that flat file and normalizes it a little, largely to save disk space (it's a huge space waster when you have to write out every attribute of each dimension on every record). So, when people say a star schema is denormalized, they are partially incorrect. The dimension tables are denormalized (snowflake schemas normalize these) but the fact table is normalized - it's got a bunch of attributes dependent on a unique primary key.
So, this concept would point to minimizing the number of dimensions in order to minimize the number of joins you need to make. Point for putting them into one dimension.
FACT TABLE SHOWS RELATIONSHIPS: Your fact table shows the relationship between otherwise unrelated dimension elements. For example, in the absence of a sale, there is no relationship between a product and a customer. A sale creates that relationship and the sale fact record models it. In your case, there is a natural relationship between section and employee (I assume, at least). You don't need a fact table to model this relationship and, therefore, they should both be in one dimension table. Another point for putting them into one dimension.
CAN AN EMPLOYEE BE IN MULTIPLE SECTIONS SIMULTANEOUSLY?: If an employee can be in multiple sections at the same time then you probably do need the fact table to model this relationship (otherwise, each employee would need two active records in the employee dimension table). Point for putting them into separate dimensions.
DO EMPLOYEES CHANGE SECTIONS FREQUENTLY?: If so, and you only have one dimension, you'll end up having to constantly be modifying the employee record in your employee dimension - this leads to a longer than needed dimension table if this is a type two slowly changing dimension (i.e. one where you're tracking the history of changes to dimension elements). Point for putting them into separate dimensions.
DO YOU NEED TO AGGREGATE BY SECTION?: If you have a lot of sections and frequently report at the section level, you may need to create a fact table aggregated at the section level. In this case, if you're a staunch believer in having your DB enforce your relationships (I am), you'll need a section table to which your fact table can relate. In this case you'll need a section table. Point for putting them into separate dimensions.
WILL THE SECTION DIMENSION BE USED BY OTHER FACT TABLES?: One tough situation with star schemas occurs when you're using conformed dimensions (i.e. dimensions that are shared by multiple fact tables). The problem occurs when the different fact tables are defined at different levels in the dimension hierarchy. In your case, imagine there is a different fact table, say one modeling equipment purchases and it only makes sense at the section, not the employee, level. In this case, you'd probably split the section dimension into its own table so it can be shared by both fact tables, your current one and that future, equipment one. This, BTW, is similar to the consideration related to aggregate tables mentioned earlier. Point for putting them into separate dimensions.
Anyhow, that's off the top of my head. So, I guess the answer is, "it depends". Both will work, it just depends on other factors you're trying to optimize for. I'll try to edit my answer if anything else comes to mind. Best of luck with this!

The second.
Employees are a WHO, departments are a WHERE.

Related

Data Warehouse Architecture Modeling

I'm trying to Architecture creating a data warehouse in the Star Schema model... any idea would be appreciated.
Any idea what I should do to create a Star Schema? Some day that I should have a linking table with DimProjects going to the fact tables. What about Project hours? What is the right approach to this or do I need other tables to link? Employee's can work on multiple projects, projects require man hours... etc.
What is the best approach on modeling?
So far I have tables:
[CODE]
Dimension Tables Measure Tables
---------------- --------------
DimEmployee FactCRM
DimProjects FactTargets
DimSalesDetails FactRevenue
DimAccounts
DimTerritories
DimDate
DimTime
[/CODE]

Dimensions in a schema of a datewarehouse means independent entities like for say
Dim_Employee
Empid(pk)
Name
Address etc likewise all other
dimensions
With each dimension keys linked to your fact like in above case
FactCRM would include only crm
related measures and would be linled
To their specific dimensions depending
upon the requirements
Without knowing the columns noone would be able to tell what you want in actual. Also remember linking a dimension to a fact is obviously a partial star schema itself so that doesnt lead to any issues. The only thing is if your dimensions are itself normalized in a schema then it becomes snowflake.
Another thing about fact related if you want to perform manipulation of othwr facts based on somw existing facts then you have to link fact table as well with a unique factid. This is called fact constellation. Then the schema would become star/snowflake schema with facy constellation

Filtering Tabular Model Dimension Across Multiple Facts

I am building a Tabular model in an educational environment, and am running into some difficulty around filtering the lookup table across multiple base tables.
I've read about bridge tables, which I believe I could use, but I have over a dozen fact tables, so the amount of corresponding bridge tables seems cumbersome to manage. Could I handle the filtering via DAX instead? I've been working to learn more DAX, but am new and stumped so far.
Here is a sample scenario of what I'm trying to accomplish: to simplify things, I have a dimension called Student which contains a row for each student. I have a fact called Discipline which contains discipline incident records for each student (1 per incident, and not all students will have a discipline incident). Also, I have a fact table called Assessment which contains assessment records for a test (1 per assessment completed by a student). Some students won't take this assessment, so they will have no corresponding scores.
When I model this data, in a Pivot Table, for example, to analyze a correlation between discipline and assessments, I bring in a measure called Discipline Incident Count (count of discipline incident records) and Assessment Average Score (average of assessment scores). I'm wanting to only view a list of students that have values for both, but I get something like the following:
Student Name --------Discipline Incident Count--------Assessment Average Score
Student A-------------------(blank)------------------------------85.7
Student B----------------------3-------------------------------(blank)
Student C----------------------2---------------------------------88.7
In this case, I would want my result set to only include student C, since they have a value for both. I have also tested handling the filtering on the application layer (Excel), by filtering out blanks in each column, but with the real data, which might have nested values and a large amount of data, doesn't seem to be working well.
Thanks for any feedback!

So, filtering this in the application layer is probably your best bet.
Otherwise, you'll have to write your measures such that each one checks the values of all other measures to determine whether to display.
An example for the sample you've shares is as follows:
ConditionalDiscipline:=
SWITCH(
TRUE()
,ISBLANK( [Assessment Average Score] )
,BLANK()
,[Discipline Incident Count]
)
SWITCH() is just syntactic sugar for nested IF()s.
Now, this measure would only work when your pivot table consists of [Discipline Incident Count] and [Assessment Average Score]. If you add another measure from a new fact table, the presence of a value for the student on the pivot table row will have no effect on the display of that row.
You'd need a new measure:
ConditionalDiscipline - version 2:=
SWITCH(
TRUE()
,ISBLANK( [Assessment Average Score] )
,BLANK()
,ISBLANK( [Your new measure] )
,BLANK()
,[Discipline Incident Count]
)
Now this version 2 will only work if both [Assessment Average Score] and [Your new measure] are non-blank. If you use this version 2 in the sample you've posted without [Your new measure], then it will still return blank for the students that have no entry for [your new measure].
Thus, you'll need one version of this for each possible combination of other measures that would be used with it.
There is no way to define "the set of all other measures currently being evaluated in the same context" in DAX, which is what you really need to test against for the existence of blanks.
You could, rather than using a SWITCH(), just multiply all the other measures together in a single test in IF(), because multiplication by blank yields blank. This does nothing for the combinatorial number of measures you'll need to write. You've got a lot of options for exactly how you define your test, but they all would end up needing the same absurd number of separate measure definitions.
I'd also like to address another point in your question. You mentioned using a bridge table. A bridge is only necessary when you implement a many-to-many relationship between dimension tables. You are not implementing this. You have several facts with a single conformed dimension. The conformed dimension is essentially a bridge between your facts, each fact existing in a many-to-many relationship with the other facts, based on the StudentKey. Any bridge table is superfluous.
The long story, shortened, is no. There are not introspective facilities in DAX sufficient to do what you want. Handle it at the application layer.

Data Warehouse - duplicate dimension members for multiple divisions

I am fairly new to data warehousing and SSIS, but I have been tasked with populating a data warehouse with sales transaction records from 2 different divisions of the parent company. My issue...I am modifying the SSIS package that populates the Product (SKUs) dimension to accommodate for the Products that pertain to the two divisions and I have ended up with a few Product names that exist in both divisions. I need a solution to accommodate the Product list for each division in the SAME dimension table. Is this possible??
To illustrate:
https://www.dropbox.com/s/hkda4n1bfs5o178/Capture.JPG?dl=0
Where 'widget_3' and 'widget_4' are named the same in both divisions, but they are NOT the same product. Just happened to be named the same. I imagine this is a common problem, but i am reluctant to make any changes to the dimension table schema before consulting with someone first.
I am working with a Product dimension table that has [MemberID] as the primary key and [Product] as a unique non clustered constraint with IGNORE_DUP_KEY = OFF. My first instinct was to modify the table schema to change the IGNORE_DUP_KEY to ON and rely on having a [Division] attribute to help populate the data in the fact table; use [Product] and [Division] to identify the [MemberID] on update.
Something like this??:
https://www.dropbox.com/s/fjzvsh80mtp3ozs/Capture2.JPG?dl=0
Am I going down the wrong path?
Notes:
- Using SQL 2008

This is at the end of the day a business problem. If there is a name conflict in two department this conflict should be resolved before to present the data togheter, else a department will find that they see some sales on their product which does not belong to them.
Once understood how to treat this at the global level (for example you will have a small department prefix in case of a clash, but this has to be agreed) the problem will be automatically solved.
When the departments could not be reached or do not agree on a solution, you could have two product name column, each for every of the two department and use them togheter as PK (I will not include a division, or at least I will not show it, because it is confusing for the end users). But I do recommend to find a business solution, not a technical one.

SSIS Population of Slowly Changing Dimension with outrigger

Working on a data warehouse, a suitable analogy for the problem is that we have Healthcare Practitioners. Healthcare Practitioners have a number of professional attributes and work in an open number of teams and in an open number of clinical areas.
For example, you may have a nurse who works in children's services across a number of teams as a relief/contractor/bank staff person. Or you may have a newly qualified doctor who works general medicine who is doing time in a special area pending qualifying as a consultant of that special area.
So we have an open number of areas of work and an open number of teams, we can't have team 1, team 2 etc in our dimensions. The other attributes may change over time also, like base location (where they work out of), the main team and area they work in..
So, following Kimble I've gone for outriggers:
Table DimHealthProfessionals:
Key (primary key, identity)
Name
Main Team
Main Area of Work
Base Location
Other Attribute 1
Other Attribute 2
Start Date
End Date
Table OutriggerHealthProfessionalTeam:
HPKey (foreign key to DimHealthPRofessionals.Key)
Team Name
Team Type
Other Team Attribute 1
Other Team Attribute 2
Table OutriggerHealthProfessionalAreaOfWork:
HPKey (as above)
Area of Work
Other AoW attribute 1
If any attribute of the HP changes, or the combination of teams or areas of work in which they work change, we need to create a new entry in the SCD and it's outrigger tables to encapsulate this.
And we're doing this in SSIS.
The source data is basically an HP table with the main attributes, a table of areas of work, a table of teams and a pair of mapping tables to map a current set of areas of work to an HP.
I have three data sources, one brings in the HCP information, one the areas of work of all HCPs and one the team memberships.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?

Admittedly I may not understand everything here, but it seems to me that the relationship in this example should be reversed. Place TeamKey and the WorkAreaKey in the dimHealthProfessionals -- this should simplify things.
With this in place, you simply make sure to deliver outriggers before the dimHealthProfessionals.
Treat outriggers as dimensions in their own right. You may want to treat dimHealthProfessionals as a type 2 dimension, to properly capture the history.
EDIT
Considering that team to person is many-to-many, a fact is more appropriate.
A column in a dimension table is appropriate only if a person can belong to only one team at a time. Same with work areas.

The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
I'm not sure I understand your question fully. If you are unsure about change detection, then use Checksums in the package. Build up a temp table with the data as it is in the source, then compare each row to its counterpart (joined via the business keys) by computing the checksum for both rows and comparing those. If they differ, the data has changed.
If you are talking about cascading updates in a historized dimension hierarchy (and you can treat the outriggers like a hierarchy in this context) then the foreign key lookups will automatically lookup the newer entry in DimHealthProfessionals if you have a historization (i.e. have validFrom / validThrough timestamps in DimHealthProfessionals). Those different foreign keys result in a different checksum.

Two or more similar counts on fact table in dimensional modelling

I have designed a fact table that stores the facts for a specific date dimension and an action type such as create, update or cancelled. The facts can be create and cancelled only once, but update many times.
myfact
---------------
date_key
location_key
action_type_key
This will allow me to get a count for all the updates done, all the new ones created for a period and specify a specific region through the location dimension.
Now in addition I also have 2 counts for each fact, i.e. Number of People, Number of Buildings. There is no relation between these. And I would like to query on how many of the facts having a specific count, such as how many have 10 building, how many have 9 etc.
What would be the best table design for these. Basically I see the following options, but am open to hear better solutions.
add the counts as reference info in the fact table as people_count and building_count
add a dimension for each of these that stores the valid options, i.e. people dimension that stores a key and a count and building dimension that stores a key and a count. The main fact will have a people_key and a building_key
add one dimension for the count these is used for both people and building counts, i.e. count dimension that stores a key and a generic count. The main fact will have a people_count_key and a building_count_key

First your counts are essentially "dimensions" in the purest sense (you can think of dimensions as a way to group records for reporting purposes). The question though is whether dimensional modeling is what you want to do. I think you are better off as seeing this as something of an implicit dimension than you are to add dimension tables. What this means essentially is that dimension tables add nothing and they create corner cases of errors I just don't think are very helpful unless you need to track a bunch of information related to numbers.
If it were me I would just add the counts to the fact table, not to other tables.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas