I am fairly new to data warehousing and SSIS, but I have been tasked with populating a data warehouse with sales transaction records from 2 different divisions of the parent company. My issue...I am modifying the SSIS package that populates the Product (SKUs) dimension to accommodate for the Products that pertain to the two divisions and I have ended up with a few Product names that exist in both divisions. I need a solution to accommodate the Product list for each division in the SAME dimension table. Is this possible??
To illustrate:
https://www.dropbox.com/s/hkda4n1bfs5o178/Capture.JPG?dl=0
Where 'widget_3' and 'widget_4' are named the same in both divisions, but they are NOT the same product. Just happened to be named the same. I imagine this is a common problem, but i am reluctant to make any changes to the dimension table schema before consulting with someone first.
I am working with a Product dimension table that has [MemberID] as the primary key and [Product] as a unique non clustered constraint with IGNORE_DUP_KEY = OFF. My first instinct was to modify the table schema to change the IGNORE_DUP_KEY to ON and rely on having a [Division] attribute to help populate the data in the fact table; use [Product] and [Division] to identify the [MemberID] on update.
Something like this??:
https://www.dropbox.com/s/fjzvsh80mtp3ozs/Capture2.JPG?dl=0
Am I going down the wrong path?
Notes:
- Using SQL 2008
This is at the end of the day a business problem. If there is a name conflict in two department this conflict should be resolved before to present the data togheter, else a department will find that they see some sales on their product which does not belong to them.
Once understood how to treat this at the global level (for example you will have a small department prefix in case of a clash, but this has to be agreed) the problem will be automatically solved.
When the departments could not be reached or do not agree on a solution, you could have two product name column, each for every of the two department and use them togheter as PK (I will not include a division, or at least I will not show it, because it is confusing for the end users). But I do recommend to find a business solution, not a technical one.
Related
I'm new to the dimensional model in dataware house design, and I face some confusion in my first design.
I take a simple business process (Vacation Request), and I want to ask which of these two designs is accurate and near to be effective and applicable, I will be grateful if I get a detailed answer please? (My question about the dimension design mainly)
1-
Dimension.Employee Fact.Vacation
[Employee Key] [Employee Key] FK PK
[_source key] [Vacation Transaction]PK DD
[Employee Name] ...
.... ...
[Campus Code]
[Campus Name]
[Department Code]
[Department Name]
[Section code]
[Section Name]
....
2-
Dimension.Employee Dimension.Section Fact.Vacation
[Employee Key] [Section Key] [Employee Key] FK PK
[_source key] [_source key] [Vacation Transaction]PK DD
[Employee Name] [Department Code] [Section Key]FK
.... [Department Name] ...
.... [Campus Code]
[Campus Name]
Where the hierarchy is like this:
Campus Contains -->
Many Departments and each department contains -->
many sections and each section contains many employees
Good question! I've faced this situation a number of times myself. I'm afraid this is going to get a bit confusing and the final answer will be, "it depends", but here are some things to consider...
WHAT A STAR SCHEMA REALLY IS: While people think of a data warehouse as a reporting database, it actually performs two functions: data integration and data distribution. It turns out that the best data structures for integration are not great for distribution (here's a blog post I wrote about this some years ago). Star schemas are really about data distribution - getting data out of the system quickly and easily. The best data structures for this have no joins, i.e. they are akin to flat files (yes, I realize there are some DB buffering considerations that might affect this a bit but, in a general sense, indexed flat files do avoid all joins).
The star schema takes that flat file and normalizes it a little, largely to save disk space (it's a huge space waster when you have to write out every attribute of each dimension on every record). So, when people say a star schema is denormalized, they are partially incorrect. The dimension tables are denormalized (snowflake schemas normalize these) but the fact table is normalized - it's got a bunch of attributes dependent on a unique primary key.
So, this concept would point to minimizing the number of dimensions in order to minimize the number of joins you need to make. Point for putting them into one dimension.
FACT TABLE SHOWS RELATIONSHIPS: Your fact table shows the relationship between otherwise unrelated dimension elements. For example, in the absence of a sale, there is no relationship between a product and a customer. A sale creates that relationship and the sale fact record models it. In your case, there is a natural relationship between section and employee (I assume, at least). You don't need a fact table to model this relationship and, therefore, they should both be in one dimension table. Another point for putting them into one dimension.
CAN AN EMPLOYEE BE IN MULTIPLE SECTIONS SIMULTANEOUSLY?: If an employee can be in multiple sections at the same time then you probably do need the fact table to model this relationship (otherwise, each employee would need two active records in the employee dimension table). Point for putting them into separate dimensions.
DO EMPLOYEES CHANGE SECTIONS FREQUENTLY?: If so, and you only have one dimension, you'll end up having to constantly be modifying the employee record in your employee dimension - this leads to a longer than needed dimension table if this is a type two slowly changing dimension (i.e. one where you're tracking the history of changes to dimension elements). Point for putting them into separate dimensions.
DO YOU NEED TO AGGREGATE BY SECTION?: If you have a lot of sections and frequently report at the section level, you may need to create a fact table aggregated at the section level. In this case, if you're a staunch believer in having your DB enforce your relationships (I am), you'll need a section table to which your fact table can relate. In this case you'll need a section table. Point for putting them into separate dimensions.
WILL THE SECTION DIMENSION BE USED BY OTHER FACT TABLES?: One tough situation with star schemas occurs when you're using conformed dimensions (i.e. dimensions that are shared by multiple fact tables). The problem occurs when the different fact tables are defined at different levels in the dimension hierarchy. In your case, imagine there is a different fact table, say one modeling equipment purchases and it only makes sense at the section, not the employee, level. In this case, you'd probably split the section dimension into its own table so it can be shared by both fact tables, your current one and that future, equipment one. This, BTW, is similar to the consideration related to aggregate tables mentioned earlier. Point for putting them into separate dimensions.
Anyhow, that's off the top of my head. So, I guess the answer is, "it depends". Both will work, it just depends on other factors you're trying to optimize for. I'll try to edit my answer if anything else comes to mind. Best of luck with this!
The second.
Employees are a WHO, departments are a WHERE.
I am developing an application for making quotations. First you make cost break down (or calculation) and upon that result you add item to quotation. The problem is that i have many product, so each category of a product will have its own cost break down form with different parameters to be filled in. If I will have only one table for cost breakdown, then it will be huge (a lot of fields in table). I have a feeling that this is not the right approach. So I came up with diagram below:
Is this solution even possible, or I must have "N" (if I have N-tables) different FK for each cost break down table? Do you have any better solutions?
I have another question if my linking table "Quotation_QtnDetail" is necessary?
It would be possible to store a reference to a particular value in one of these tables by having a CalculationType column indicating which table the record is in, along with a generic reference ID column (containing the ID of the relevant record). For example, if you were storing a CalcId of 123 and a CalculationType of 2, this would point to the record with ID 123 in the Calc2 table.
The downside to doing this is you're going to lose the ability to validate your data using FK constraints, and it will also make joins to your calculation tables a bit more complicated.
Regarding the Quotation_QtnDetail table, unless a QtnDetail record could ever be linked to multiple Quotation records, there is no need for this extra linking table. Instead, just link it directly by adding a QtnId column to the QtnDetail table. Similarly, you may also be able to remove the Calc_QtnItm table if an item is only ever linked to a single calculation record.
I found a schema in Google Images (see below) that can illustrate a problem I having in my data warehouse design:
My design is different, but this is the simplest figure I could find to convey my question, which is given the figure, I'm wondering how could the schema accommodate the following scenario: if a product had a unique number assigned to it by the SalesOrg (salesOrg_product_number)...For example, a salesOrg sells food items and assigns all food items of the same kind the same unique salesOrg_product_number. A different salesOrg would have a different salesOrg_product_number for that type of product.
I'm inclined to place the salesOrg_product_number attribute in the Product dimension table, but part of me thinks it should be in the salesOrg dimension table instead. I'm wondering which one of these is correct way in a data warehouse (not relational db) design to maintain the star schema?
In a perfect world the Primary Keys of a dimension table should be just surrogate key, without any meaning for the business. Table IDs should be invisible for the final users, but business code should be of course available.
A possible solution would be to have a product table with a structure like:
Product_id
Product_desc
Product_SO1_number
Product_SO2_number
...
Of course this will require to show the correct field to the correct Sales Organization. Depending on your reporting tool this can be more or less difficult. For example if you write your query manually you need just to put the right column in your select.
Another possibility would be to have a product/sales_org table, a table which combine the Product and the Sales_Org one:
Product_Sales_Org_id
Product_id
Sales_Org_id
Product_SO_number
...
This table will be child of the two dimension table and on the fact table you will have Product_Sales_Org_id column. Depending on Product and Sales Organization the Product_SO_number will return the correct number per SO.
If you want to have this in a star schema structure you can put Product/Sales_Org/Product_Sales_Org together in only one table like:
Product_Sales_Org_id
Product_id
Sales_Org_id
Product_desc
Sales_Org_desc
Product_SO_number
...
Sincerely I would go for the second solution, keep the Product and the Sales_Org tables separated, because they are two different business entities and implement the relationship table in the middle.
I hope this helps.
i have designed places related warehouse tables - DimPlaces, FactPlaces, DimGeography. It is straightforward design if you see. All the locations is in DimPlaces (Addrline1, Addrline2,placename,etc) and geography hierarchy is in DimGeography (City, State, Country, PostCode). FactPlaces is table which has got foriegn keys to DimPlaces and DimGeography.
I would like to maintain historical data as there are chances that places names or their properties might change and at the same time if the location of a place changes then geographic hierarchy key changes.
I have found design pattern -
Another useful design pattern is to add the durable account key to the fact table in addition to the dimension’s surrogate key. This joins back to the current rows in the dimension to make it easier to report all of history by the current dimension attributes.
Could you please suggest is this OK to follow this solution? If yes, do i need to use KEY of type UNIQUEIDENTIFIER for a unique value?
Another question on this - I have employees data (DimEmployee and FactEmployee). Each employee is associated with the places where he works. How to connect These EMPLOYEE TABLES with the PLACES TABLES. Do I need to connect FACTEMPLOYEE WITH FACTPLACES?
I think in the first instance, they're referring to business keys? So if your dimension table has two rows, surrogate key 1 & 2, but they both refer to the same thing, so both have AccountId/ProductId/WhateverId of 1, then you will have some fact table rows with surrogate key 1 and business key 1, and later ones with surrogate key 2 and business key 1.
Uniqueidentifiers are very wide, try and avoid using them on fact tables and for joins if possible.
For your last question - That's really more a reporting thing. Do you need to do that? Is that what people need to see, do they need to slice by that? You could consider a referenced dimension - Where the places table links to the fact tables via a placeId on the employees dimension. Or, you could have a factemployees table with start and stop dates. It depends on what you need to achieve.
I currently working on an issue tracker for my company to help them keep track of problems that arise with the network. I am using C# and SQL.
Each issue has about twenty things we need to keep track of(status, work loss, who created it, who's working on it, etc). I need to attach a list of teams affected by the issue to each entry in my main issue table. The list of teams affected ideally contains some sort of link to a unique table instance, just for that issue, that shows the list of teams affected and what percentage of each teams labs are affected.
So my question is what is the best way to impliment this "link" between an entry into the issue table and a unique table for that issue? Or am I thinking about this problem wrong.
What you are describing is called a "many-to-many" relationship. A team can be affected by many issues, and likewise an issue can affect many teams.
In SQL database design, this sort of relationship requires a third table, one that contains a reference to each of the other two tables. For example:
CREATE TABLE teams (
team_id INTEGER PRIMARY KEY
-- other attributes
);
CREATE TABLE issues (
issue_id INTEGER PRIMARY KEY
-- other attributes
);
CREATE TABLE team_issue (
issue_id INTEGER NOT NULL,
team_id INTEGER NOT NULL,
FOREIGN KEY (issue_id) REFERENCES issues(issue_id),
FOREIGN KEY (team_id) REFERENCES teams(team_id),
PRIMARY KEY (issue_id, team_id)
);
This sounds like a classic many-to-many relationship...
You probably want three tables,
One for issues, with one record (row) per each individual unique issue created...
One for the teams, with one record for each team in your company...
And one table called say, "IssueTeams" or "TeamIssueAssociations" `or "IssueAffectedTeams" to hold the association between the two...
This last table will have one record (row) for each team an issue affects... This table will have a 2-column composite primary key, on the columns IssueId, AND TeamId... Every row will have to have a unique combination of these two values... Each of which is individually a Foreign Key (FK) to the Issue table, and the Team Table, respectively.
For each team, there may be zero to many records in this table, for each issue the team is affected by,
and for each Issue, there may be zero to many records each of which represents a team the issue affects.
If I understand the question correctly I would create....
ISSUE table containing the 20 so so items
TEAM table containing a list of teams.
TEAM_ISSUES table containing the link beteen the two
The TEAM_ISSUES table needs to contain a foriegn key to the ISSUE and TEAM tables (ie it should contain an ISSUE_ID and a TEAM_ID... it therefore acts as an intersection between the two "master" tables. It sounds like this is also the place to put the percentage.
Does that make sense?
There are so many good free open source issue trackers available that you should have pretty good reasons for implementing your own. You could use your time much better in customizing an existing tracker.
We are using Bugtracker.NET in the team I work for. It's been customized quite a bit, but there was no point in developing a system from the beginning. The reason we chose that product was that it runs on .NET and works great with SQL Server, but there are many other alternatives.
We can see those entities in your domain:
The "Issue"
"Teams" affected by that issue, in a certain percentage
So, having identified those two items, you can represent that with two tables, and the relationship between them is another table, that could track the percentage impact too.
Hope this helps.
I wouldn't create a unique table for each issue. I would do something like this
Table: Issue
IssueId primary key
status
workLoss
createdby
etc
Table: Team
TeamID primary key
TeamName
etc
Table: IssueTeam
IssueID (foreign key to issue table)
TeamID (foreign key to team table)
PercentLabsAffected
Unless I'm understanding wrong what you're trying to do, you should not have a unique table for each instance of an issue.
Your database should have three tables: an Issues table, a Teams table, and an IssueTeams joining table. The IssueTeams table would include foreign keys (i.e. TeamID and IssueID) that reference the respective team in Teams and issue in Issues. So Issue Teams might have records like (Issue1, Team1), (Issue1, Team3). You could keep the affected percentage of each teams' labs in the joining table.
Well, just to be all modern and agile-y, 'getting it right the first time' is less trendy than 'refactorable.' But to work through your model:
You have Issues (heh heh). You have Teams.
An Issue affects many Teams. A Team is affected by many Issues. So just for the basic problem, you seem to have a classic Many:Many relationship. A join table containing two columns, one to Issue PK and one to Team PK takes care of that.
Then you have the question of what % of teams. There's a dynamic aspect to that, of course, so to do it right, you'll need to specify a trigger. But the obvious place to put it is a column in Issue ("Affected_Team_Percentage").
If I understand you correctly, you want to create a new table of teams affected for each issue. Creating tables as part of normal operations rings my relational database design alarm bell. Don't do it!
Instead, use one affected_teams table with a foreign key to the issues table and a foreign key to the teams table. That will do the trick.