Star schema for target and actual comparison Kimball

Star schema for target and actual comparison Kimball - ssas

I am going to model one of the star schemas for a university data warehousing project. We need to compare the actual application count with a target.
There are target counts (set by the colleges every year) associated with Departments, Course groups, and Courses.
The requirement is to ensure that the targets set get correctly allocated and also the progress of applications against the target.
One proposal is to include all the actual counts (department level total accepted count, course group level total accepted count, course level total accepted count) and corresponding target counts (dept level target counts, course group target counts, course target counts) in single fact table. One of the dimensions in this star schema is Course dimensions. It consists of all the
course, course group and department information. I do understand a granularity problem here, but this could be handled at the cube implementation level.
Or if I want to set the target at different hierarchy levels of a dimension, should I build different fact tables? As mentioned below:
Implement 3 fact tables for 3 different types of targets and connect these fact tables to the actual fact table. In this situation, the Course dimension can snowflake into course group dimension and department dimensions. First fact is connected to course. Second fact to course group and the 3rd one to dept. The actual fact table is of granularity of course level so all three fact tables can be connected to the actual fact table via course. Note that the actual fact table is of grain course-level and this can be aggregated to get higher level such as course group and dept actual counts.
Data Architects please comment!

Related

How to store static (common data) in databases

Imagine, I want to record marks scored by participants in exam, and for every record total marks will remain same. So I will want to store just obtained marks, and this common data (static in terms of C# language) should be not be stored redundantly against each record, as its same for all records. I understand its not exactly redundancy by definition of normalization as it is a legitimate data for each record. But again there should be some better and smart mean to store such static/common/metadata information.
I know one argument will be to incorporate such business logic thing either in middle tier or better in logical schema of db (by virtue of views). But again we want each and every data to lie in tables, and logical data just be induced from data available in tables, rather it hold its alien data/info.
Can anyone suggest better?

You need a table for the exam, which would include the maximum marks value, the name of the exam
And a separate table for the exam_results. Likely this is a many to many relationship between exam and student.
To push the point, you might also want a venues table, which might have a maximum number of students who can take an exam, then an exam_sessions table as a many to many relationship between exam and venue.

Filtering Tabular Model Dimension Across Multiple Facts

I am building a Tabular model in an educational environment, and am running into some difficulty around filtering the lookup table across multiple base tables.
I've read about bridge tables, which I believe I could use, but I have over a dozen fact tables, so the amount of corresponding bridge tables seems cumbersome to manage. Could I handle the filtering via DAX instead? I've been working to learn more DAX, but am new and stumped so far.
Here is a sample scenario of what I'm trying to accomplish: to simplify things, I have a dimension called Student which contains a row for each student. I have a fact called Discipline which contains discipline incident records for each student (1 per incident, and not all students will have a discipline incident). Also, I have a fact table called Assessment which contains assessment records for a test (1 per assessment completed by a student). Some students won't take this assessment, so they will have no corresponding scores.
When I model this data, in a Pivot Table, for example, to analyze a correlation between discipline and assessments, I bring in a measure called Discipline Incident Count (count of discipline incident records) and Assessment Average Score (average of assessment scores). I'm wanting to only view a list of students that have values for both, but I get something like the following:
Student Name --------Discipline Incident Count--------Assessment Average Score
Student A-------------------(blank)------------------------------85.7
Student B----------------------3-------------------------------(blank)
Student C----------------------2---------------------------------88.7
In this case, I would want my result set to only include student C, since they have a value for both. I have also tested handling the filtering on the application layer (Excel), by filtering out blanks in each column, but with the real data, which might have nested values and a large amount of data, doesn't seem to be working well.
Thanks for any feedback!

So, filtering this in the application layer is probably your best bet.
Otherwise, you'll have to write your measures such that each one checks the values of all other measures to determine whether to display.
An example for the sample you've shares is as follows:
ConditionalDiscipline:=
SWITCH(
TRUE()
,ISBLANK( [Assessment Average Score] )
,BLANK()
,[Discipline Incident Count]
)
SWITCH() is just syntactic sugar for nested IF()s.
Now, this measure would only work when your pivot table consists of [Discipline Incident Count] and [Assessment Average Score]. If you add another measure from a new fact table, the presence of a value for the student on the pivot table row will have no effect on the display of that row.
You'd need a new measure:
ConditionalDiscipline - version 2:=
SWITCH(
TRUE()
,ISBLANK( [Assessment Average Score] )
,BLANK()
,ISBLANK( [Your new measure] )
,BLANK()
,[Discipline Incident Count]
)
Now this version 2 will only work if both [Assessment Average Score] and [Your new measure] are non-blank. If you use this version 2 in the sample you've posted without [Your new measure], then it will still return blank for the students that have no entry for [your new measure].
Thus, you'll need one version of this for each possible combination of other measures that would be used with it.
There is no way to define "the set of all other measures currently being evaluated in the same context" in DAX, which is what you really need to test against for the existence of blanks.
You could, rather than using a SWITCH(), just multiply all the other measures together in a single test in IF(), because multiplication by blank yields blank. This does nothing for the combinatorial number of measures you'll need to write. You've got a lot of options for exactly how you define your test, but they all would end up needing the same absurd number of separate measure definitions.
I'd also like to address another point in your question. You mentioned using a bridge table. A bridge is only necessary when you implement a many-to-many relationship between dimension tables. You are not implementing this. You have several facts with a single conformed dimension. The conformed dimension is essentially a bridge between your facts, each fact existing in a many-to-many relationship with the other facts, based on the StudentKey. Any bridge table is superfluous.
The long story, shortened, is no. There are not introspective facilities in DAX sufficient to do what you want. Handle it at the application layer.

How yo choose the right intermediate measure group in many-many relationship when you have multiple options

I have a fact table called "FactActivity" and a few dimension tables like users, clients, actions, date and tenants.
I create measure groups corresponding to each of them as follows
FactActivity => Sum of ActivityCount colums
DimUser => Count of rows
DimTenant => Count of rows
DimDate => Count of rows and distinct count of weekofyear column
Each user can do multiple actions using multiple clients. A tenant is logical grouping of users. So a tenant contain multiple users but a user can't belong to more than 1 tenant. All the dimension tables and fact tables are connected to DimDate via regular relationship.
The cube structure is as follows.
Now I want to defined the dimension relationships to each of the measure group. Some of them are Many-Many relationsip (to enable distinct count calculation). The designer is showing me multiple options to choose from for many of the intersections. I'm confused as to which one to select as intermediate measure group. Should I always pick the measure group whose total # rows is the least ex: DimDate? Or what is the right logic to determine the intermediate measure group.
This is what I got. IS this right? If no, what is wrong?
For more information to hep choose the right answer.
FactActivity = 1 billion rows
DimUser = 35 million rows
DimTenant = 1 million rows
DimDate = 1000 rows

The correct way to choose the intermediate measure group depends on how you want to evaluate your measures with respect to the dimension related:
Let's start with Activity measure group to Tenant dimension: The question is: How should Analysis Services determine the activity count (or any other measure in the Activity measure group) of a tenant? The only reasonable way to determine this would be to go from the activity fact table through the user table to the tenant table. And actually, the last relationship is not a many-to-many relationship, but a many-to one relationship. I. e. you could optimize away the tenant dimension by integrating it into the user dimension. However, using a many-to-many relationship will work as well, just be a little less efficient. You might also consider using a reference relationship from user to tenant instead of a many-to-many relationship. And there may be other considerations why you may have chosen to have them two separate dimensions, thus I do not discuss this any further.
Now let us continue with the next one: Tenant measure group to User dimension: The way you have configured it (using the date measure group) means that for each date that a tenant and a user have in common, the tenant count of a user adds one to the count. This is probably not what you want. I would assume you want to relate tenant measures to user dimension by the user measure group. However, I am not sure what the purpose of the DateKey in the user and tenant dimension tables is at all. Thus, your relationship may be correct.
Let's continue with the relationships from the Date measure group to the Tenant and User dimensions. I would assume there should be no relationship at all, as the week of the year and the date count do not depend on tenants or users. Please note that it is absolutely ok to have no relationship between some measure groups and some dimensions. If you look at the Microsoft sample cube "Adventure Works", it has more gray cells (i. e. measure group and dimension being unrelated) in the Dimension Usage than white ones (i. e. there is some kind of relationship between measure group and dimension, of whichever type). In the default setting of IgnoreUnrelatedDimensions = true of a measure group, this means that the measure value will be the same for all members of the dimension. This should be the case for date count and week of year. However, again, as I do not know the purpose of the DateKey in the user and tenant dimension tables, I am not sure if this assumption is correct for your data.
And after these examples, I would hope you can continue with the rest of relationships yourself.

SSIS Population of Slowly Changing Dimension with outrigger

Working on a data warehouse, a suitable analogy for the problem is that we have Healthcare Practitioners. Healthcare Practitioners have a number of professional attributes and work in an open number of teams and in an open number of clinical areas.
For example, you may have a nurse who works in children's services across a number of teams as a relief/contractor/bank staff person. Or you may have a newly qualified doctor who works general medicine who is doing time in a special area pending qualifying as a consultant of that special area.
So we have an open number of areas of work and an open number of teams, we can't have team 1, team 2 etc in our dimensions. The other attributes may change over time also, like base location (where they work out of), the main team and area they work in..
So, following Kimble I've gone for outriggers:
Table DimHealthProfessionals:
Key (primary key, identity)
Name
Main Team
Main Area of Work
Base Location
Other Attribute 1
Other Attribute 2
Start Date
End Date
Table OutriggerHealthProfessionalTeam:
HPKey (foreign key to DimHealthPRofessionals.Key)
Team Name
Team Type
Other Team Attribute 1
Other Team Attribute 2
Table OutriggerHealthProfessionalAreaOfWork:
HPKey (as above)
Area of Work
Other AoW attribute 1
If any attribute of the HP changes, or the combination of teams or areas of work in which they work change, we need to create a new entry in the SCD and it's outrigger tables to encapsulate this.
And we're doing this in SSIS.
The source data is basically an HP table with the main attributes, a table of areas of work, a table of teams and a pair of mapping tables to map a current set of areas of work to an HP.
I have three data sources, one brings in the HCP information, one the areas of work of all HCPs and one the team memberships.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?

Admittedly I may not understand everything here, but it seems to me that the relationship in this example should be reversed. Place TeamKey and the WorkAreaKey in the dimHealthProfessionals -- this should simplify things.
With this in place, you simply make sure to deliver outriggers before the dimHealthProfessionals.
Treat outriggers as dimensions in their own right. You may want to treat dimHealthProfessionals as a type 2 dimension, to properly capture the history.
EDIT
Considering that team to person is many-to-many, a fact is more appropriate.
A column in a dimension table is appropriate only if a person can belong to only one team at a time. Same with work areas.

The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
I'm not sure I understand your question fully. If you are unsure about change detection, then use Checksums in the package. Build up a temp table with the data as it is in the source, then compare each row to its counterpart (joined via the business keys) by computing the checksum for both rows and comparing those. If they differ, the data has changed.
If you are talking about cascading updates in a historized dimension hierarchy (and you can treat the outriggers like a hierarchy in this context) then the foreign key lookups will automatically lookup the newer entry in DimHealthProfessionals if you have a historization (i.e. have validFrom / validThrough timestamps in DimHealthProfessionals). Those different foreign keys result in a different checksum.

Two or more similar counts on fact table in dimensional modelling

I have designed a fact table that stores the facts for a specific date dimension and an action type such as create, update or cancelled. The facts can be create and cancelled only once, but update many times.
myfact
---------------
date_key
location_key
action_type_key
This will allow me to get a count for all the updates done, all the new ones created for a period and specify a specific region through the location dimension.
Now in addition I also have 2 counts for each fact, i.e. Number of People, Number of Buildings. There is no relation between these. And I would like to query on how many of the facts having a specific count, such as how many have 10 building, how many have 9 etc.
What would be the best table design for these. Basically I see the following options, but am open to hear better solutions.
add the counts as reference info in the fact table as people_count and building_count
add a dimension for each of these that stores the valid options, i.e. people dimension that stores a key and a count and building dimension that stores a key and a count. The main fact will have a people_key and a building_key
add one dimension for the count these is used for both people and building counts, i.e. count dimension that stores a key and a generic count. The main fact will have a people_count_key and a building_count_key

First your counts are essentially "dimensions" in the purest sense (you can think of dimensions as a way to group records for reporting purposes). The question though is whether dimensional modeling is what you want to do. I think you are better off as seeing this as something of an implicit dimension than you are to add dimension tables. What this means essentially is that dimension tables add nothing and they create corner cases of errors I just don't think are very helpful unless you need to track a bunch of information related to numbers.
If it were me I would just add the counts to the fact table, not to other tables.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas