In data modelling is it acceptable for a dimension to have a surrogate key to another dimension as an attribute or should this always be a business key?
I have an Item dimension, which has a Department Number as an attribute. I also have a Department dimension. Is it acceptable for the Item Dimension to hold the SK to the Department Dimension or just the business key?
Usually you would avoid to have both the natural and the surrogate key as foreign keys in the table, because that is redundant and can lead to data inconsistency. Example: Someone updates the natural keys and forgets to also update the surrogate key.
In a data warehouse, which you tagged your request with, however, redundancy is not so much considered a problem. There is usually an transaction processing system with loads of inserts, updates and deletes, and then there is the data warehouse. The data warehouse gets all its data beautifully arranged from the processing system and there simply are no updates as the one mentioned above. If data is redundant, who cares? It simplifies data access. You can even store the employee-department join as a table with all the department data redundant. A data warehouse is all about easy and quick access to data, so as to make reporting easier. Redundant foreign keys are no problem in a data warehouse.
Related
I'm building a data warehouse, and the data is of a quality where 8 fields may be required to uniquely identify a record, and this applies to three tables, each of which will have a few million rows of data per year. It's all 0NF.
Obviously every situation is unique, but considering that the purpose of the data warehouse is for OLAP, am I right in thinking that I would be better to create a single column to use as the primary key rather than a composite primary key of 8 separate fields? It's straightforward to concatenate the fields into an extra column as part of the ETL pipeline.
I appreciate the redundancy increases the storage requirement, and we are talking millions of rows a year, but I'm guessing it'll significantly improve query performance? And reduce memory requirements if the data is modelled in a BI tool?
Can anybody give me any general thoughts or advice on this please?
Below is some entirely made-up simulated data. I need to like the order table to the shipment table to get where the order was shipped from, for example, or maybe the order table to the shipment table to sum the quantity shipped.
I don't think normalising the tables is the way to go, as all four of the columns I'm using here would be subject to change, and only combined they form a reliable key for a unique shipment.
Because of this the data is bulk deleted/inserted based on shift date.
Thanks!
Phil.
Those look like fact tables. In a dimensional model only dimension tables need single-column keys. Fact tables typically have compound keys made up of the dimension foreign keys that define the fact table grain.
I need to create a bus matrix and in order to do that i need to know which fact table has relationships with which dimension tables.
Unfortunately, in this new project I'm in, it seems to be no FK (crazy, i know).
What I thought about is to use ETL queries and check the joins between the Fact table with the dimension tables.
What I'm worried about is that there might be more relationships that are not included in ETL queries...any advice?
You can use the system metadata tables to list the foreign key references:
select tbname, pkcolnames, reftbname, fkcolnames, colcount
from SYSIBM.SYSRELS B;
If the database does not have properly declared foreign key relationships, then the database does not have the information you are looking for.
Assuming the DB holds no information about the FKs (or information that would help you derive them, like identical column names) then, as you mentioned, examining the ETL code used to load each fact table is probably the only other way of doing this. The ETL must be running a look up on each dimension to get the PK to insert into the fact record, so the information will be there.
There shouldn't be any relationships involving facts that you couldn't determine with this approach. There may be additional relationships between dimensions (bridge tables, more complex SCD types, etc.) but if you sorted out the fact relationships then what remains should be a small enough subset to resolve manually (i.e. by intelligent guesswork)
I have a fact table with five dimension tables associated to it.Typically, the fact table contains the surrogate keys of each dimension and has no business/surrogate key. I am trying to load the fact table with data resulted of the staging fact table i.e.Insert new records. However, I notice the fact table can also handle other operations such as Update or Delete on data. A conditional split was used in the SSIS Package for this purpose to check if all surrogate keys are 0 then make the new insert. My question is, Can I use the surrogate keys in terms of Update or Delete?
I made an insert on the fact table just to give an idea of how the data will look like.
The answer is yes, you can. BUT, will there be a situation where one employee sold the same product, from the same supplier, to the same customer, on the same day? Perhaps a different order on the same day? (this is based on the data you present in the question)
If all the surrogate keys together can uniquely identify a record, update fact records to your hearts content. But, if that is not the case, you could end up updating records when you do not intend to update.
I tend to include an order number in the fact tables I design to help avoid that situation, but you may not have that in your actual fact tables. Including the order number is a pattern referred to a degenerate dimension in the fact table. I have found it to be pretty handy.
Anyway, the answer is the same. You can update fact records based on surrogate keys, as long as all of them together can uniquely identify the row(s) you want to update.
Don't throw caution to the wind, be sure your data warehouse is designed such that you can do this if you need to. Being able to do in place updates of facts can be nice, versus delete and replace, in that there could be fewer steps in the ETL process.
Suppose I have a set of 'model' entities, and a set of difficulty levels. Each model has a certain percentage success rate for a given day on a given difficulty level.
A model entity has a name, which is both unique and immutable under any circumstances, so it makes a natural primary key. A difficulty level is described by its name only (easy, normal, etc). Difficulty levels are very, very unlikely to change, though it's possible a new one could be added. A success rate record is uniquely identified by the model it pertains to, the difficulty level, and the date.
Here's the most trivial db design for this scenario:
In this design, 'name' is the primary key for the table 'models' and is represented by a VARCHAR(20) field. Likewise, a VARCHAR(20) field 'name' is the primary key for the table 'difficulty_levels' (a lookup table). In the 'success_rates' table, 'model_name' is a foreign key referencing the 'name' field in the 'model' table, and 'difficulty_level' is a foreign key referencing the 'name' field in the 'difficulty_levels' table. The fields 'model_name', 'difficulty_level' and 'date' make up a composite primary key for the 'success_rates' table.
The most used queries would be:
getting all success rates for a certain model, difficulty level, and date period
getting the most/least successful models for a certain period and difficulty level.
Now, my question is - do I need to add surrogate primary keys to the 'models' and 'difficulty_levels' tables? I guess storing int values as opposed to varchar values in the foreign key fields of 'success_rates' takes up less space, and maybe the queries will be faster (just my wild guess, not sure about that)?
The problem I see with surrogate keys is that they have zero relevance to the business logic itself. I'm planning on using a mini-ORM (most likely Dapper), and without the surrogate keys I can operate on classes that very cleanly represent the entities I'm working with. Now, if I add surrogate keys, I'll have to add 'Id' properties to my classes, and I'm really against adding a database storage implementation like that to a class that can be used anywhere in the app, not even in connection with a database storage. I could add proxy storage classes with an Id property, but that adds another level of complexity. Plus the fact that the 'Id' property won't be readonly (so the ORM can set the ids after saving the entity to the database) means that it would be possible to accidentally set it to a random/invalid value.
I'm not very familiar with ORM's and I have zero knowledge of Dapper, so correct me if I was wrong in any of these points.
What would be the best approach here?
The problem I see with surrogate keys is that they have zero relevance to the business logic itself.
That is actually the benefit, not the problem. The key's value is the immutable identifier of the entity, regardless of whatever else changes. I really doubt that there is any widely used ORM that cannot easily make use of it, as the alternative (cascading changes to a key value to every child record) is so much more difficult to implement.
I'd also suggest that you add a value to your difficulty levels that allows the hierarchy of increasing difficulty to be represented in the data model, otherwise "harder than" or "easier than" cannot be robustly represented.
i have designed places related warehouse tables - DimPlaces, FactPlaces, DimGeography. It is straightforward design if you see. All the locations is in DimPlaces (Addrline1, Addrline2,placename,etc) and geography hierarchy is in DimGeography (City, State, Country, PostCode). FactPlaces is table which has got foriegn keys to DimPlaces and DimGeography.
I would like to maintain historical data as there are chances that places names or their properties might change and at the same time if the location of a place changes then geographic hierarchy key changes.
I have found design pattern -
Another useful design pattern is to add the durable account key to the fact table in addition to the dimension’s surrogate key. This joins back to the current rows in the dimension to make it easier to report all of history by the current dimension attributes.
Could you please suggest is this OK to follow this solution? If yes, do i need to use KEY of type UNIQUEIDENTIFIER for a unique value?
Another question on this - I have employees data (DimEmployee and FactEmployee). Each employee is associated with the places where he works. How to connect These EMPLOYEE TABLES with the PLACES TABLES. Do I need to connect FACTEMPLOYEE WITH FACTPLACES?
I think in the first instance, they're referring to business keys? So if your dimension table has two rows, surrogate key 1 & 2, but they both refer to the same thing, so both have AccountId/ProductId/WhateverId of 1, then you will have some fact table rows with surrogate key 1 and business key 1, and later ones with surrogate key 2 and business key 1.
Uniqueidentifiers are very wide, try and avoid using them on fact tables and for joins if possible.
For your last question - That's really more a reporting thing. Do you need to do that? Is that what people need to see, do they need to slice by that? You could consider a referenced dimension - Where the places table links to the fact tables via a placeId on the employees dimension. Or, you could have a factemployees table with start and stop dates. It depends on what you need to achieve.