Filtering Tabular Model Dimension Across Multiple Facts - sql

I am building a Tabular model in an educational environment, and am running into some difficulty around filtering the lookup table across multiple base tables.
I've read about bridge tables, which I believe I could use, but I have over a dozen fact tables, so the amount of corresponding bridge tables seems cumbersome to manage. Could I handle the filtering via DAX instead? I've been working to learn more DAX, but am new and stumped so far.
Here is a sample scenario of what I'm trying to accomplish: to simplify things, I have a dimension called Student which contains a row for each student. I have a fact called Discipline which contains discipline incident records for each student (1 per incident, and not all students will have a discipline incident). Also, I have a fact table called Assessment which contains assessment records for a test (1 per assessment completed by a student). Some students won't take this assessment, so they will have no corresponding scores.
When I model this data, in a Pivot Table, for example, to analyze a correlation between discipline and assessments, I bring in a measure called Discipline Incident Count (count of discipline incident records) and Assessment Average Score (average of assessment scores). I'm wanting to only view a list of students that have values for both, but I get something like the following:
Student Name --------Discipline Incident Count--------Assessment Average Score
Student A-------------------(blank)------------------------------85.7
Student B----------------------3-------------------------------(blank)
Student C----------------------2---------------------------------88.7
In this case, I would want my result set to only include student C, since they have a value for both. I have also tested handling the filtering on the application layer (Excel), by filtering out blanks in each column, but with the real data, which might have nested values and a large amount of data, doesn't seem to be working well.
Thanks for any feedback!

So, filtering this in the application layer is probably your best bet.
Otherwise, you'll have to write your measures such that each one checks the values of all other measures to determine whether to display.
An example for the sample you've shares is as follows:
ConditionalDiscipline:=
SWITCH(
TRUE()
,ISBLANK( [Assessment Average Score] )
,BLANK()
,[Discipline Incident Count]
)
SWITCH() is just syntactic sugar for nested IF()s.
Now, this measure would only work when your pivot table consists of [Discipline Incident Count] and [Assessment Average Score]. If you add another measure from a new fact table, the presence of a value for the student on the pivot table row will have no effect on the display of that row.
You'd need a new measure:
ConditionalDiscipline - version 2:=
SWITCH(
TRUE()
,ISBLANK( [Assessment Average Score] )
,BLANK()
,ISBLANK( [Your new measure] )
,BLANK()
,[Discipline Incident Count]
)
Now this version 2 will only work if both [Assessment Average Score] and [Your new measure] are non-blank. If you use this version 2 in the sample you've posted without [Your new measure], then it will still return blank for the students that have no entry for [your new measure].
Thus, you'll need one version of this for each possible combination of other measures that would be used with it.
There is no way to define "the set of all other measures currently being evaluated in the same context" in DAX, which is what you really need to test against for the existence of blanks.
You could, rather than using a SWITCH(), just multiply all the other measures together in a single test in IF(), because multiplication by blank yields blank. This does nothing for the combinatorial number of measures you'll need to write. You've got a lot of options for exactly how you define your test, but they all would end up needing the same absurd number of separate measure definitions.
I'd also like to address another point in your question. You mentioned using a bridge table. A bridge is only necessary when you implement a many-to-many relationship between dimension tables. You are not implementing this. You have several facts with a single conformed dimension. The conformed dimension is essentially a bridge between your facts, each fact existing in a many-to-many relationship with the other facts, based on the StudentKey. Any bridge table is superfluous.
The long story, shortened, is no. There are not introspective facilities in DAX sufficient to do what you want. Handle it at the application layer.

Related

Put hierarchy in its own separate dimension or set it as a part of related dimension?

I'm new to the dimensional model in dataware house design, and I face some confusion in my first design.
I take a simple business process (Vacation Request), and I want to ask which of these two designs is accurate and near to be effective and applicable, I will be grateful if I get a detailed answer please? (My question about the dimension design mainly)
1-
Dimension.Employee Fact.Vacation
[Employee Key] [Employee Key] FK PK
[_source key] [Vacation Transaction]PK DD
[Employee Name] ...
.... ...
[Campus Code]
[Campus Name]
[Department Code]
[Department Name]
[Section code]
[Section Name]
....
2-
Dimension.Employee Dimension.Section Fact.Vacation
[Employee Key] [Section Key] [Employee Key] FK PK
[_source key] [_source key] [Vacation Transaction]PK DD
[Employee Name] [Department Code] [Section Key]FK
.... [Department Name] ...
.... [Campus Code]
[Campus Name]
Where the hierarchy is like this:
Campus Contains -->
Many Departments and each department contains -->
many sections and each section contains many employees
Good question! I've faced this situation a number of times myself. I'm afraid this is going to get a bit confusing and the final answer will be, "it depends", but here are some things to consider...
WHAT A STAR SCHEMA REALLY IS: While people think of a data warehouse as a reporting database, it actually performs two functions: data integration and data distribution. It turns out that the best data structures for integration are not great for distribution (here's a blog post I wrote about this some years ago). Star schemas are really about data distribution - getting data out of the system quickly and easily. The best data structures for this have no joins, i.e. they are akin to flat files (yes, I realize there are some DB buffering considerations that might affect this a bit but, in a general sense, indexed flat files do avoid all joins).
The star schema takes that flat file and normalizes it a little, largely to save disk space (it's a huge space waster when you have to write out every attribute of each dimension on every record). So, when people say a star schema is denormalized, they are partially incorrect. The dimension tables are denormalized (snowflake schemas normalize these) but the fact table is normalized - it's got a bunch of attributes dependent on a unique primary key.
So, this concept would point to minimizing the number of dimensions in order to minimize the number of joins you need to make. Point for putting them into one dimension.
FACT TABLE SHOWS RELATIONSHIPS: Your fact table shows the relationship between otherwise unrelated dimension elements. For example, in the absence of a sale, there is no relationship between a product and a customer. A sale creates that relationship and the sale fact record models it. In your case, there is a natural relationship between section and employee (I assume, at least). You don't need a fact table to model this relationship and, therefore, they should both be in one dimension table. Another point for putting them into one dimension.
CAN AN EMPLOYEE BE IN MULTIPLE SECTIONS SIMULTANEOUSLY?: If an employee can be in multiple sections at the same time then you probably do need the fact table to model this relationship (otherwise, each employee would need two active records in the employee dimension table). Point for putting them into separate dimensions.
DO EMPLOYEES CHANGE SECTIONS FREQUENTLY?: If so, and you only have one dimension, you'll end up having to constantly be modifying the employee record in your employee dimension - this leads to a longer than needed dimension table if this is a type two slowly changing dimension (i.e. one where you're tracking the history of changes to dimension elements). Point for putting them into separate dimensions.
DO YOU NEED TO AGGREGATE BY SECTION?: If you have a lot of sections and frequently report at the section level, you may need to create a fact table aggregated at the section level. In this case, if you're a staunch believer in having your DB enforce your relationships (I am), you'll need a section table to which your fact table can relate. In this case you'll need a section table. Point for putting them into separate dimensions.
WILL THE SECTION DIMENSION BE USED BY OTHER FACT TABLES?: One tough situation with star schemas occurs when you're using conformed dimensions (i.e. dimensions that are shared by multiple fact tables). The problem occurs when the different fact tables are defined at different levels in the dimension hierarchy. In your case, imagine there is a different fact table, say one modeling equipment purchases and it only makes sense at the section, not the employee, level. In this case, you'd probably split the section dimension into its own table so it can be shared by both fact tables, your current one and that future, equipment one. This, BTW, is similar to the consideration related to aggregate tables mentioned earlier. Point for putting them into separate dimensions.
Anyhow, that's off the top of my head. So, I guess the answer is, "it depends". Both will work, it just depends on other factors you're trying to optimize for. I'll try to edit my answer if anything else comes to mind. Best of luck with this!
The second.
Employees are a WHO, departments are a WHERE.

Designing a cube

I’ve been asked create our analysis cube and have a design question.
We sell ‘widgets’ and ‘parts’ to go with those widgets. Each order has many widgets and sometimes a few parts.
What I’m stuck on is – to me, an order is a fact in a measure. But, what are the widgets? Are they a dimension and each fact in the measure will be an entry for every part and widget for the order.
So, if order 123 had widget 1 and widget 2 and part 5, then there will be 3 facts in the measure for the same order? Is that correct?
At its basic level you can consider most facts to be transactions or transaction line items. So, for example, you may have a 'sales' fact table in which each record represents one line item from that sale. Each fact record would have numeric columns representing metrics and other columns joining to dimension tables. The combination of those dimensions would describe that line item. So, in your case, you likely have something like:
1) A 'date' dimension detailing the date of the transaction
2) A 'widget' dimension detailing the widget sold on that transaction
3) A 'customer' dimension detailing the customer who bought that item (almost certainly the same customer would appear on every line item for this transaction)
4) ... determined by what information you have and what business problem you're trying to solve.
Now, the dimension tables contain further details. For example, your widget dimension table likely contains things like the name of the widget, the color, the manufacturer, etc. Every time your company sells one of these widgets, the record in the fact table links to that same dimension record for that name, color, manufacturer, etc. combination (i.e. you don't create a new dimension record every time you sell the same item - this is a one-to-many relationship - each dimension record may have many related fact records).
You other dimension tables would similarly describe their dimensions. For example, the customer dimension might give the customer's name, their address, ...
So, the short answer to your question is that widget likely is a dimension, items and widgets may (or may not) actually be the same dimension (in a school class I suspect that they are), and that you would have 3 fact records for that one transaction.
This is probably along the same lines as the prior answer but....
If you try and model "many widgets per order" you'll have issues because you end up with a many (order fact) to many (widgets) relationship. In a cube / star schema design, many to many relationship usually need to be moddeled out to be many to one in some way.
So what you do is try and identify what special thing identifies an "order" (as opposed to a bunch of widgets in an order). Usually that is simply stuff like order date, customer, order number, tax
An example way to model this is:
If you have a single order with five widgets, you model that as a fact table with five records that happens to have a repeating widget, customer, date etc. in it
Then you have to work out how you spread an order header tax amount over five records. The two obvious solutions are:
Create a widget that represents tax and add that as another record
Spread the tax over five records, either evenly or weighted by something
Modelling "parts" just takes these concepts further.
It is important to understand what the end user wants to see, why they want to see parts. What do they want to measure by parts, how do you assign higher level values (like tax) down to lower levels like parts.

Star schema for target and actual comparison Kimball

I am going to model one of the star schemas for a university data warehousing project. We need to compare the actual application count with a target.
There are target counts (set by the colleges every year) associated with Departments, Course groups, and Courses.
The requirement is to ensure that the targets set get correctly allocated and also the progress of applications against the target.
One proposal is to include all the actual counts (department level total accepted count, course group level total accepted count, course level total accepted count) and corresponding target counts (dept level target counts, course group target counts, course target counts) in single fact table. One of the dimensions in this star schema is Course dimensions. It consists of all the
course, course group and department information. I do understand a granularity problem here, but this could be handled at the cube implementation level.
Or if I want to set the target at different hierarchy levels of a dimension, should I build different fact tables? As mentioned below:
Implement 3 fact tables for 3 different types of targets and connect these fact tables to the actual fact table. In this situation, the Course dimension can snowflake into course group dimension and department dimensions. First fact is connected to course. Second fact to course group and the 3rd one to dept. The actual fact table is of granularity of course level so all three fact tables can be connected to the actual fact table via course. Note that the actual fact table is of grain course-level and this can be aggregated to get higher level such as course group and dept actual counts.
Data Architects please comment!

database: summarizing data which expires

I'm struggling to find an efficient and flexible representation for my data. We have a many-to-many relationship between two entities which have arbitrary lifetimes. Let's call these Voter and Candidate. Each relationship has a measurement which we'd like to summarize in various ways. These are timestamped and are guaranteed to be within the lifetime of the two related entities. Let's say the measure is approval rating, or just Rating.
One unusual requirement is that if I'm summarizing a period which has no measurement, I should substitute the latest valid measurement, rather than giving NULL.
Our current solution is to compile a list of valid voters and candidates for each day, then formulate a many-to-many table which records the latest valid measure.
What would your solution be?
This allows me to do a single query to get a daily summary:
select
avg(rating), valid_date, candidate_SSN, candidate_DOB
from
daily_rating natural join rating
group by
valid_date, candidate_SSN, candidate_DOB
This might work ok, but It seems inefficient to me. We're repeating a lot of data, especially if nothing happens for a given day. It also is unclear how to do weekly/monthly summaries without compiling even more tables. Since we're dealing with millions of rows (we're not really talking about voter polls...) I'm looking for a more efficient solution.
I have used data-warehousing technique here, hence the dim and fact table names.
dimDate is so-called date dimension, one row per a date.
dimCandidate has all candidate data, new and old records. In data-warehousing terms this is called type 2 dimension. One candidate can have several rows in this table, only one of them having r_status = 'current'.
Fields
, r_valid_from date
, r_valid_to date
, r_version integer -- (1, 2, 3,..)
, r_status varchar(10) -- (expired, current)
describe a record (row) status. Each time a candidate status changes, a new row is inserted and the pervious row's r_valid_to and r_status are modified.
CandidateFullName is a business (natural) key and has to uniquely identify a candidate. No two candidates can have the same CandidateFullName. Note that the CandidateKey uniquely identifies a row in the table, while CandidateFullName uniquely identifies a candidate.
dimVoter has voter data, new and old records -- just like the dimCandidate.
dimCampaign describes campaign details, this is so-called type one dimension, does not hold historical data.
factRating has the Rating measure.
Normaly this would be enough, but there is the reqirement to interpolate the missing data for a day; for that, an aggregate table aggDailyRating is introduced. At the end of a day, a scheduled job aggregates ratings for the day. This job takes care of the data-interpolation requirement.
This way the aggregate table has one row for each date-(valid) candidate-campaign combination. Note that voter is not included in the combination, data is aggregated over all voters.
Any reporting is done on the aggregate table, for example
--
-- monthy rating for years 2009-2010
-- for candidate john_smith_256
--
select
CalendarYear
, MonthNumber
, avg(DailyRating) as AverageRating
from aggDailyRating as f
join dimDate as d on d.DateKey = f.DateKey
join dimCandidate as c on c.CandidateKey = f.CandidateKey
where CandidateFullName = 'john_smith_256'
and CalendarYear between 2009 and 2010
group by CalendarYear, MonthNumber
order by CalendarYear desc, MonthNumber desc ;
Yes, that is very inefficient and wasteful. It is merely a set of files, not reasonably comparable to a set of "tables" or a "database"; extensions and enhancements to it will compound the duplication and inefficiency. Duplication is the antithesis of a database. In database terms, there are far more efficient and easier ways to implement that.
Assumption
Your post does not provide much info, so I have had to make some assumptions, but I think you can correct my submission quite easily if any of them are incorrect. Otherwise comment, and I will correct my submission.
A Voter is a Person; a Candidate is a Voter; (Candidate = subset of Voter)
A Campaign is related to Candidate (not to a Polling Campaign).
A Poll is a survey of the Voters response to a Candidate's performance, staring on a set date, running over a few days, and completing on an set date.
There are many Measures, such as ApprovalRating, that are surveyed in each Poll.
The Measures of such surveys across all Voters are aggregated at the Poll level.
Limitation
The expiry requirement is unclear, so I am not suggesting I have implemented that. If the model does not provide that for you (if it is not immediately obvious), supply details and I will add to the model. The current model provides exclusion/inclusion capability for what I understand the expiry requirement to be.
The Poll::Measure does not have enough info to be implemented fully; I need further details. The submission is primitive and unconstrained in that area.
Likewise, any Poll::Campaign relation or constraint ("there are many Polls per Campaign, and they are always related to Campaign") has not been implemented.
The arrangement of the key in the child tables is arbitrary for now: if you identify the most common queries, it can be re-arranged, so that the most those obtain the best speed.
Submission
Campaign Poll Data Model
This is just a Relational (Normalised; zero duplication) Database, pure IDEF1X, including provision for the consideration that the child tables will be huge: migration of narrow surrogate keys into the child tables, avoiding migration of wide keys.
It provides "data warehouse" capability as is. In fact, if it does not provide any BI or DSS requirement in a single query, that is only due to lack of detail from you; please provide, and I will happily change it. (Note, your item re "single query" is actually "single file"; joins are pedestrian in a Relational database.)
Keys such as %Code are 2-, 3-, and at most 4-characters. Such keys are just as fast as Integer keys, and very helpful (makes sense) when perusing the tables (without having to join the parent).
Any and all aggregation, either to load the historic rows, or to produce aggregates for the current values, should be possible in a single Relational (set-oriented) command; you should not need to resort to serial (cursor) processing. Again, if you think you need to, please comment and I will provide the set-oriented method.
We implement Versioning in DBs quite differently to the way it is done in DWs, and without limitations. Please identify if you require versioning of (eg) Candidate, and I will provide.
Last, the Null requirement is not unusual. It is catered for here. Again, if you think it isn't ...

Two or more similar counts on fact table in dimensional modelling

I have designed a fact table that stores the facts for a specific date dimension and an action type such as create, update or cancelled. The facts can be create and cancelled only once, but update many times.
myfact
---------------
date_key
location_key
action_type_key
This will allow me to get a count for all the updates done, all the new ones created for a period and specify a specific region through the location dimension.
Now in addition I also have 2 counts for each fact, i.e. Number of People, Number of Buildings. There is no relation between these. And I would like to query on how many of the facts having a specific count, such as how many have 10 building, how many have 9 etc.
What would be the best table design for these. Basically I see the following options, but am open to hear better solutions.
add the counts as reference info in the fact table as people_count and building_count
add a dimension for each of these that stores the valid options, i.e. people dimension that stores a key and a count and building dimension that stores a key and a count. The main fact will have a people_key and a building_key
add one dimension for the count these is used for both people and building counts, i.e. count dimension that stores a key and a generic count. The main fact will have a people_count_key and a building_count_key
First your counts are essentially "dimensions" in the purest sense (you can think of dimensions as a way to group records for reporting purposes). The question though is whether dimensional modeling is what you want to do. I think you are better off as seeing this as something of an implicit dimension than you are to add dimension tables. What this means essentially is that dimension tables add nothing and they create corner cases of errors I just don't think are very helpful unless you need to track a bunch of information related to numbers.
If it were me I would just add the counts to the fact table, not to other tables.