Star Schema - Fact table without uniqueness - ssas

We have a data warehouse that contains a large fact table with over 100 million rows. I'm trying to create a cube that includes this fact table and need to create a fact dimension based off of this table. The issue that I'm running into is that there is no way to find uniqueness on this table using the fields that would be included in the fact dimension, without using every field in the table.
I created a surrogate key in the dsv using:
Row_Number() OVER (ORDER BY ID, Dt, Num)
I've used this method to create a surrogate key in another dsv and it worked, but I was also able to find uniqueness with the fields in the Order By.
When I browse the cube based on this fact table I get the correct results when using regular dimensions. When I try to use fields from fact dimension I get eroneous results in most cases...some are correct, though very few.
Would this be a case where I should request that a surrogate key get created on the fact table? Is there a better solution that someone could suggest?

Related

OLAP Data Warehouse - composite primary key as multiple or single fields

I'm building a data warehouse, and the data is of a quality where 8 fields may be required to uniquely identify a record, and this applies to three tables, each of which will have a few million rows of data per year. It's all 0NF.
Obviously every situation is unique, but considering that the purpose of the data warehouse is for OLAP, am I right in thinking that I would be better to create a single column to use as the primary key rather than a composite primary key of 8 separate fields? It's straightforward to concatenate the fields into an extra column as part of the ETL pipeline.
I appreciate the redundancy increases the storage requirement, and we are talking millions of rows a year, but I'm guessing it'll significantly improve query performance? And reduce memory requirements if the data is modelled in a BI tool?
Can anybody give me any general thoughts or advice on this please?
Below is some entirely made-up simulated data. I need to like the order table to the shipment table to get where the order was shipped from, for example, or maybe the order table to the shipment table to sum the quantity shipped.
I don't think normalising the tables is the way to go, as all four of the columns I'm using here would be subject to change, and only combined they form a reliable key for a unique shipment.
Because of this the data is bulk deleted/inserted based on shift date.
Thanks!
Phil.
Those look like fact tables. In a dimensional model only dimension tables need single-column keys. Fact tables typically have compound keys made up of the dimension foreign keys that define the fact table grain.

Is it relevant to use the fact table primary key in dimension table?

I am designing a database and I am wondering something about primary keys and foreign keys. I have a kind of snowflake database diagram with a fact table and some dimension tables (if I can call it like this). Because of what I am doing, I need to generate a record in my fact table before adding rows in dimension tables and these rows (and tables) are using the primary key of my fact table.
I am reading topics about it and I see that I should use a ID in dimension tables that should be referenced in the fact table (the opposite of what I am doing).
Let me show you a part my diagram for a better understanding :
First of all, sorry, attributes of tables are written in French (I am a French guy, sorry for my bad english btw).
The "MASQUENumMasque" in "dimension" tables reference "NumMasque" of the table "MASQUE", and I use this foreign key as primary key of tables using it.
So, my question is very simple, I am doing right?
If you need more informations or if you are misunderstanding something, tell me!
Thank you guys!
You are doing it wrong, the data should always be added from the edge of the snowflake model, to the center.
You always have to add a row in the dimension, and then the data in the fact table, pointing to the dimension you just added. Otherwise you will have constraint issues.
Example: You have a fact table: ORDERS (order_id, shop_id) and a dimension table SHOP ( shop_id, shop_name). When loading a new set of data, you will load in the dimension table first, because it will be then referenced in the fact table throught the shop_id key. If you load the fact table first, orders.shop_id will point to nowhere.
So in your case, for table RONCHI for example, you should have columns Ronchi_id. In your table Masque, you should have a column Ronchi_id pointing to the RONCHI's pk.

Should I apply type 2 history to tables with duplicate keys?

I'm working on a data warehouse project using BigQuery. We're loading daily files exported from various mainframe systems. Most tables have unique keys which we can use to create the type 2 history, but some tables, e.g. a ledger/positions table, can have duplicate rows. These files contain the full data extract from the source system every day.
We're currently able to maintain a type 2 history for most tables without knowing the primary keys, as long as all rows in a load are unique, but we have a challenge with tables where this is not the case.
One person on the project has suggested that the way to handle it is to "compare duplicates", meaning that if the DWH table has 5 identical rows and the staging tables has 6 identical rows, then we just insert one more, and if it is the other way around, we just close one of the records in the DWH table (by setting the end date to now). This could be implemented by adding and extra "sub row" key to the dataset like this:
Row_number() over(partition by “all data columns” order by SystemTime) as data_row_nr
I've tried to find out if this is good practice or not, but without any luck. Something about it just seems wrong to me, and I can't see what unforeseen consequences can arise from doing it like this.
Can anybody tell me what the best way to go is when dealing with full loads of ledger data on a daily basis, for which we want to maintain some kind of history in the DWH?
No, I do not think this would be a good idea to introduce an artificial primary key based on all columns plus the index of the duplicated row.
You will solve the technical problem, but I doubt there will be some business value.
First of all you should distinct – the tables you get with primary key are dimensions and you can recognise changes and build history.
But the table without PK are most probably fact tables (i.e. transaction records) that are typically not full loaded but loaded based on some DELTA criterion.
Anyway you will never be able to recognise an update in those records, only possible change is insert (deletes are typically not relevant as data warehouse keeps longer history that the source system).
So my todo list
Check if the dup are intended or illegal
Try to find a delta criterion to load the fact tables
If everything fails, make the primary key of all columns with a single attribute of the number of duplicates and build the history.

What's the best way to create a unique repeatable identification column with rows that are nearly identicle?

I'm writing a stored procedure that links together data from several different relational tables based on the primary key for the main table. This information is being send to a flat database. The stored procedure is going to produce several nearly identical rows where only a single column may be different due to multiple entries in some of the tables that are linked to a single entry in the main table. I need to uniquely identify each row in the stored procedure output but I am unable to use the primary key from the main table since there will be multiple entries for each "key".
I considered taking the approach of using the primary key from the main table followed by each of the columns that may be different in duplicate rows. For example _
However, this approach results in a very long and messy key. I am unable to use a GUID because if any data changes in the relational database the stored procedure is rerun and must update old entries rather than create new ones.
If your purpose is only to have a unique key that is as short as possible and does not relate to anything else, consider just adding ROW_NUMBER() to your select.
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), othercolumns

Creating an OLAP Cube from a flat table in SSAS/SSRS

I'm new to that topic. I've got a database with a flat fact table, which contain data like date, product group, product subgroup, product actual name, and some calculations/statistics. All I need to do is create a report using olap cube. I have got two ideas how to create that, but dont know which draft is better (if even correct). The original DAILY_REPORT... table has not a primary key. Its just a data table. In first concept I have created every table (which will be as a dimension) with a ID, and connected the product->family of product->project->building in a hierarchy. Another concept is without all ID's and hierarchy. Relation created automatically based on names. Can somebody explain me in which direction I should tend...?
First idea:
http://imgur.com/iKNfAXF
Second:
http://imgur.com/IZjW1W6
Thanks in advance!
You can follow these steps to create your cube:
Create a separate view for each of the dimensions you want to have. Group similar type of data in one view, for e.g. Product Name, Product Group, Product Sub-Group, etc.
Keep the data in your dimension view as DISTINCT data. for e.g. SELECT DISTINCT [Product Name], [Product Group], [Product Sub-Group] FROM TABLE
Keep an 'ID' column in each dimension view, for e.g. Product ID in Product view
Create a view for your fact. Include 'ID' column of each dimension in your Fact view. This will help you to create relationship on 'ID' column, which will be a lot faster than relationship created on top of names.
For creating hierarchies in dimension attributes, SSAS provide drag and drop functionality.
If you need more details let me know.
You could construct the dimensions you need by views that based on distinct queries (i.e. SELECT DISTINCT) from the source data. These can be used to populate the dimensions.
You can make a synthetic date dimension fairly easily.
Then you can create a DSV that joins the views back against the fact table to populate the measure group.
If you need to fake a primary key then you can use a view that annotates the fact table with a column generated from row_number() or some similar means. Note that this is not necessarily stable across runs, so you can't rely on it for incremental loads. However, it would work fine for complete refreshes.