I am developing an ssas database and have snowflaked dimensions to which it has links. For example I have a customer dimension table, distributor dimension table and a territory dimension table in which there is a relationship to the latter from the other two. Therefore I can illustrate the relationships as follows:
Retailer <-- Territory
Distributor <-- Territory
In a specific cube in the database, I have measures where all the three dimensions mentioned above have relationships to. As far as the measures are considered browsing across individual dimensions happen smoothly.
But the problem comes when I try to browse a related measure from two dimensions at the same time; eg: territory and distributor
All the distributors are shown under a given territory.
When I add the territory key attribute to the distributor dimension and that specific attribute is used from the distributor dimension it self the relationship is shown correctly. But when I try to go from the territory dimension in the cube this relationship does not get exposed as explained earlier.
Any help is deeply appreciated.
This may not answer your question directly, but if you have several dimensions that are closely related and often used together, you could consolidate them into a "mini-dimension" that has every possible combination of territory, distributor and retailer (see my answer to another question):
create table dbo.DIM_TerritorySalesChannels (
TerritorySalesChannelID int not null primary key,
TerritoryName nvarchar(100) not null,
RetailerName nvarchar(100) not null,
DistributorName nvarchar(100) not null,
/* other attributes */
)
This might initially seem awkward, but it's actually quite easy to populate and manage and it avoids the complexity of relationships between dimensions, which often gets messy (as you've discovered). Obviously you end up with one very large dimension instead of three smaller ones, but as I mentioned in the other answer, we have several hundred thousand rows in one dimension and it's never been an issue for us.
Related
I'm new to the dimensional model in dataware house design, and I face some confusion in my first design.
I take a simple business process (Vacation Request), and I want to ask which of these two designs is accurate and near to be effective and applicable, I will be grateful if I get a detailed answer please? (My question about the dimension design mainly)
1-
Dimension.Employee Fact.Vacation
[Employee Key] [Employee Key] FK PK
[_source key] [Vacation Transaction]PK DD
[Employee Name] ...
.... ...
[Campus Code]
[Campus Name]
[Department Code]
[Department Name]
[Section code]
[Section Name]
....
2-
Dimension.Employee Dimension.Section Fact.Vacation
[Employee Key] [Section Key] [Employee Key] FK PK
[_source key] [_source key] [Vacation Transaction]PK DD
[Employee Name] [Department Code] [Section Key]FK
.... [Department Name] ...
.... [Campus Code]
[Campus Name]
Where the hierarchy is like this:
Campus Contains -->
Many Departments and each department contains -->
many sections and each section contains many employees
Good question! I've faced this situation a number of times myself. I'm afraid this is going to get a bit confusing and the final answer will be, "it depends", but here are some things to consider...
WHAT A STAR SCHEMA REALLY IS: While people think of a data warehouse as a reporting database, it actually performs two functions: data integration and data distribution. It turns out that the best data structures for integration are not great for distribution (here's a blog post I wrote about this some years ago). Star schemas are really about data distribution - getting data out of the system quickly and easily. The best data structures for this have no joins, i.e. they are akin to flat files (yes, I realize there are some DB buffering considerations that might affect this a bit but, in a general sense, indexed flat files do avoid all joins).
The star schema takes that flat file and normalizes it a little, largely to save disk space (it's a huge space waster when you have to write out every attribute of each dimension on every record). So, when people say a star schema is denormalized, they are partially incorrect. The dimension tables are denormalized (snowflake schemas normalize these) but the fact table is normalized - it's got a bunch of attributes dependent on a unique primary key.
So, this concept would point to minimizing the number of dimensions in order to minimize the number of joins you need to make. Point for putting them into one dimension.
FACT TABLE SHOWS RELATIONSHIPS: Your fact table shows the relationship between otherwise unrelated dimension elements. For example, in the absence of a sale, there is no relationship between a product and a customer. A sale creates that relationship and the sale fact record models it. In your case, there is a natural relationship between section and employee (I assume, at least). You don't need a fact table to model this relationship and, therefore, they should both be in one dimension table. Another point for putting them into one dimension.
CAN AN EMPLOYEE BE IN MULTIPLE SECTIONS SIMULTANEOUSLY?: If an employee can be in multiple sections at the same time then you probably do need the fact table to model this relationship (otherwise, each employee would need two active records in the employee dimension table). Point for putting them into separate dimensions.
DO EMPLOYEES CHANGE SECTIONS FREQUENTLY?: If so, and you only have one dimension, you'll end up having to constantly be modifying the employee record in your employee dimension - this leads to a longer than needed dimension table if this is a type two slowly changing dimension (i.e. one where you're tracking the history of changes to dimension elements). Point for putting them into separate dimensions.
DO YOU NEED TO AGGREGATE BY SECTION?: If you have a lot of sections and frequently report at the section level, you may need to create a fact table aggregated at the section level. In this case, if you're a staunch believer in having your DB enforce your relationships (I am), you'll need a section table to which your fact table can relate. In this case you'll need a section table. Point for putting them into separate dimensions.
WILL THE SECTION DIMENSION BE USED BY OTHER FACT TABLES?: One tough situation with star schemas occurs when you're using conformed dimensions (i.e. dimensions that are shared by multiple fact tables). The problem occurs when the different fact tables are defined at different levels in the dimension hierarchy. In your case, imagine there is a different fact table, say one modeling equipment purchases and it only makes sense at the section, not the employee, level. In this case, you'd probably split the section dimension into its own table so it can be shared by both fact tables, your current one and that future, equipment one. This, BTW, is similar to the consideration related to aggregate tables mentioned earlier. Point for putting them into separate dimensions.
Anyhow, that's off the top of my head. So, I guess the answer is, "it depends". Both will work, it just depends on other factors you're trying to optimize for. I'll try to edit my answer if anything else comes to mind. Best of luck with this!
The second.
Employees are a WHO, departments are a WHERE.
I am developing a website for a dealer of wine and other alcoholic beverages. Obviously, each wine is made in a country that must be modeled in the Wine table.
But many times, a Wine also has a Region (Languedoc, Rioja, Bougogne etc.), these regions are of course in a parent-child relationship with a country.
THe following options exist:
-Giving the Wine table only a reference to a region
Problem is that some wines/whiskeys do not mention a region, only a country
-Giving the wine table 2 separate FK references, to a Country and a Region table. This introduces a circular reference and a redundacny problem becase country and region are already related.
-Using a Location table and a single FK referernce from the Wine table to the Location table. THe Location table is in fact a region or a country (maybe even a city) so it has a field "location_type" and a parent FK field, referring to its own PK. For the top-level Country entries, the parent id is null.
This is the example I have found somewhere in the internet. It will make however the queries more complex.
Is this a known problem, and are there any suggestions?
TIA, Klaas
I'm also working on an application in this domain. Common for wine, there is also the concept of sub-region or appellation, so you can have wines from France-Bourgogne-Cote d'Or, for example. I went with the second option you described, having FK references to Country, Region, and Subregion. Only the Country field is required, while the others are nullable. The potential issues with referential integrity are compounded with this model, but it greatly facilitates effective query based on these fields, which is kind of the point of capturing this information in the first place.
I think you might want to look at this from a dimensional analysis perspective, rather than a strict entity-relationship one. That is, a but of denormalization may be just what you are looking for. I would recommend Ralph Kimball's books on dimensional data warehouses, since they often solve this type of problem.
In your case, just create a "location" dimension that contains all the fields that you might be interested in, at the lowest level of granularity:
Region
Country
SubCountry
Are the two obvious ones. You might also have hillside, city, whatever.
To take an example, you would have the following rows:
Barolo/Italy/Piemonte
NULL/Italy/Piemonte
NULL/Italy/NULL
The wines would be connected to this table.
Now, you have a maintenance problem for this table. However, the universe of wines and official regions is well known and very slowly changing, so I don't see this as a problem.
Good point about creating a location dimension. This type of model addresses analysis more effectively, but is more complicated for transactional systems. This gets into the question of whether you're optimizing your model to handle CRUD-type transactions, or for aggregated data analysis.
On the whole, I assume that Klaus is looking at modeling for a transactional system with basic query, rather than an analysis-based application like a data warehouse.
I was hoping someone could explain the appropriate use of the 'FACT Relationship Type' under the Dimension Usage tab. Is it simply to create a dimension out of your fact table to access attribute on the fact table itself?
Thanks in advance!
Yes, if your fact table has attributes that you would like to slice by (create a dimension from), you would use this relationship type.
Functionally, to the users it behaves no differently than a regular relationship.
After you create your dimensions and cubes you need to define how each dimension is related to each measure group. A measure group is a set of measures exposed by a single fact table.
Each cube can contain multiple fact tables and multiple dimensions. However, not every dimension will be related to every fact table.
To define relationships right click the cube in BIDS and choose open; then navigate to the Dimension Usage tab. If you click the ellipsis button next to each dimension you will see a screen that allows you to change dimension usage for a particular measure group. You can choose from the following options:
Regular default option; the dimension is joined directly to the fact table
No relationship the dimension is not related to the current measure group
Fact the dimension and fact are derived from a single table. If this is the case your dimensional warehouse has poor design and isn't likely to perform well. Consider separating fact and dimension tables.
Referenced the dimension is joined to an intermediate table prior to being joined to the fact table. Referenced relationship resembles a snowflake dimension, but is slightly different. Suppose you have a customer dimension and a sales fact; you'd like to examine total sales by customer, but you also want to examine line item sales by customer. Instead of duplicating the customer key in the line item fact table you can treat the sales fact as an intermediate table to join customer to line item.
Many-to-many this option involves two fact tables and two dimension tables. Dimension A is joined to an intermediate fact A, which in turn joins to dimension B to which the fact B is joined. Much like with fact option if you need to use many-to-many option your design could probably use some improvement. This type of relationship is sometimes necessary if you are building cubes on top of a relational database that is in 3rd normal form. It is strongly advisable to use a dimensional model with star schema for all cubes. For example you could have two fact tables: vehicles and options; each vehicle can come with a number of options. You're likely to examine vehicle sales by customer, and options by the items that are included in each option. Therefore you would have a customer dimension and item dimension. You could also want to examine vehicles sales by included item. If so the vehicle fact would be joined to the options fact and customer dimension; the options fact would also join to items' dimension.
Data mining target dimension is based on a mining model which is built from a source dimension. Both source dimension and target dimension must be included in the cube.
Working on a data warehouse, a suitable analogy for the problem is that we have Healthcare Practitioners. Healthcare Practitioners have a number of professional attributes and work in an open number of teams and in an open number of clinical areas.
For example, you may have a nurse who works in children's services across a number of teams as a relief/contractor/bank staff person. Or you may have a newly qualified doctor who works general medicine who is doing time in a special area pending qualifying as a consultant of that special area.
So we have an open number of areas of work and an open number of teams, we can't have team 1, team 2 etc in our dimensions. The other attributes may change over time also, like base location (where they work out of), the main team and area they work in..
So, following Kimble I've gone for outriggers:
Table DimHealthProfessionals:
Key (primary key, identity)
Name
Main Team
Main Area of Work
Base Location
Other Attribute 1
Other Attribute 2
Start Date
End Date
Table OutriggerHealthProfessionalTeam:
HPKey (foreign key to DimHealthPRofessionals.Key)
Team Name
Team Type
Other Team Attribute 1
Other Team Attribute 2
Table OutriggerHealthProfessionalAreaOfWork:
HPKey (as above)
Area of Work
Other AoW attribute 1
If any attribute of the HP changes, or the combination of teams or areas of work in which they work change, we need to create a new entry in the SCD and it's outrigger tables to encapsulate this.
And we're doing this in SSIS.
The source data is basically an HP table with the main attributes, a table of areas of work, a table of teams and a pair of mapping tables to map a current set of areas of work to an HP.
I have three data sources, one brings in the HCP information, one the areas of work of all HCPs and one the team memberships.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
Admittedly I may not understand everything here, but it seems to me that the relationship in this example should be reversed. Place TeamKey and the WorkAreaKey in the dimHealthProfessionals -- this should simplify things.
With this in place, you simply make sure to deliver outriggers before the dimHealthProfessionals.
Treat outriggers as dimensions in their own right. You may want to treat dimHealthProfessionals as a type 2 dimension, to properly capture the history.
EDIT
Considering that team to person is many-to-many, a fact is more appropriate.
A column in a dimension table is appropriate only if a person can belong to only one team at a time. Same with work areas.
The problem is how to run over all three datasets to determine if an HP has changed an attribute, and if they have changed an attribute, how we update the DIM and two outriggers appropriately.
Can anyone point me at a best practice for this? OR suggest an alternative way of modelling this dimension?
I'm not sure I understand your question fully. If you are unsure about change detection, then use Checksums in the package. Build up a temp table with the data as it is in the source, then compare each row to its counterpart (joined via the business keys) by computing the checksum for both rows and comparing those. If they differ, the data has changed.
If you are talking about cascading updates in a historized dimension hierarchy (and you can treat the outriggers like a hierarchy in this context) then the foreign key lookups will automatically lookup the newer entry in DimHealthProfessionals if you have a historization (i.e. have validFrom / validThrough timestamps in DimHealthProfessionals). Those different foreign keys result in a different checksum.
I have designed a relatively simple data warehouse that uses the star schema. I have a fact table with just a primary key along with CompanyID and Amount (the actual measurement) columns. Of course I also have a dimension table to represent the companies which the fact table references.
Now I'm required to create a single level hierarchy (CompanyGroup) for companies. This seems like an easy task but the catch is that a single company should be allowed to exist within multiple CompanyGroups.
I experimented with this by creating a new dimension table called CompanyHierarchy that holds a primary key, GroupKey and CompanyKey. Defining a user defined hierarchy where GroupKey is the top level and CompanyKey is the second level yields A duplicate attribute key has been found error for the CompanyKey attribute while processing the dimension.
So, I'm not quite sure how to even start with this. How can I create a user defined hierarchy within a dimension where attributes can exist multiple times?
Screen shot of my current cube definition can be seen at:
img132.imageshack.us/img132/6729/ssasm2m.gif
You need to create a many-to-many relationship (one company can belong to many groups and one group can have many companies) There is an example of a many-to-many relationship in the Adventure Works cube around the sales reason dimension and there is an extensive white paper here that explains a number of different ways of using many-to-many relationships.
There is also a technique for supporting multiple members in the one hierarchy that I documented here