Loading fact table with SCD type 2 dimension - pentaho

I have got a dimension tables with 1 million records which is SCD type 2.I am using pentaho Dimension lookup step for populating this dimension table. I am getting a version number,start date and end date. Now I want to populate the fact table based on the scd type2. What is the best approach for this?

Use the 'Dimension lookup/update' step for looking up the surrogate id ('Technical key field'), based on the natural key(s) ('Keys') and timestamp ('Stream Datefield'). Uncheck 'Update the dimension' if you only do lookup.

Related

Store 3-dimensional table in database where 1 dimension increases over time

I have a data set with three dimensions that I would like to store for use with a website:
A list of companies (about 1000)
Information about the company (about 15 things)
Time (monthly)
Essentially, I want to track this information over time and keep it up to date.
When I start, the data will be 1000x15x1, after a year it will be 1000x15x12, and after 10 years if will be 1000x15x120.
The main queries I would make are:
Get all information for one company over all times
Get all information for one particular time
What would be a good database configuration for doing this? I'm open to either SQL or noSQL solutions.
In case it matters, the website is on Google App Engine.
From the relational database schema design perspective:
If the goal is analytics / ad-hoc querying / OLAP in general only, then you can use star-schema which is well suited for these type of analytics. But beware, OLAP databases are de-normalized and not suitable for operational transaction storage / OLTP in general, if you are planning to do both on this database.
The beauty of the Star schema:
The fact tables are usually all numeric, making the tables very small even though there are too many records. Small table means it is very fast to read (I/O).
All joins from the fact table to dimension tables are based on foreign keys (single column, numeric, indexable foreign keys)
All dimension tables have surrogate key, which is a single column primary key. Single column primary key is easier to JOIN than a multi-column primary key and also easier to index.
There is no NULL in foreign keys in fact tables. This makes JOIN operations straightforward, i.e. always JOIN fact table to all of its dimension tables. If you need NULL case, you need to add that as a special case in your dimension table. For example: if a company is not listed on stock market, and one of the thing you track is stock price, then you enter 0 or NULL into the fact for the stock price table depending on (how you want to do SUM(), AVG() etc later) and then add a special case into your StockSymbols dimension table called 'Private company' and add the foreign key of this special case into the fact table as your foreign key.
Almost all filtering is done through the dimension tables that are much much smaller than the fact tables. This requires having a Date dimension to be able to do date-based queries.
If you can stay in pure Star schema, then all yours JOINs are single hop (i.e. no join between two tables through another table).
All these makes JOIN operations very fast, simple and straightforward. That's why the Star schema is at the heart of data-warehousing designs.
https://en.wikipedia.org/wiki/Star_schema
https://en.wikipedia.org/wiki/Data_warehouse
One level up from this is OLAP (SSAS SQL Server Analyses Services for example) which does pre-processing of the data to make it fast to query but it involves more learning than pure start-schema and it's an overkill in your case
For your example
In Star schema,
Companies will be a dimension table
You will need Month dimension table. It's simplified version of Date dimension, just for month info. An example of Date dimension is here.
https://www.codeproject.com/Articles/647950/Create-and-Populate-Date-Dimension-for-Data-Wareho
The information about the company (15 things you say) will be fact tables. The facts must be numeric (b/c ideally all non-numeric values is saved in dimension tables). This means taking the non-numeric part of a fact to a dimension table. For example: if you are keeping revenue and would like to keep the currency type too, then the you will need a Currency dimension and save only the amount in the fact table and a foreign key to the Currency dimension table.
If you have any non-numeric facts, you need to store the distinct list in a dimension table and add foreign key to that dimension table inside your Fact table (this is called factless fact table). The only exception to that is if the cardinality of the dimension and the fact table is very similar, then you can just store the non-numeric fact value inside the fact table directly as there is no benefit in having a dimension table (in fact a disadvantage).
Also the facts can be grouped by their granularity. For example you could have company_monthly_summary fact table and keep more than one fact in that table (which are all joining to Company dimension and Month dimension). This is all up-to-you how you would like to group facts table. But if their granularity are not the same, they should not be grouped as that will cause sparse fact tables and harder to query.
You will use foreign keys in Fact tables to join to your Dimension tables
Add index for your Dimension tables' most used columns
Add a numeric surrogate key to your dimension. It is usually an auto-increment number but that's up-to you. One exception people prefers for the surrogate key of Date dimension is using the format YYYYMMDD (as integer). This makes is easier on WHERE clause: i.e instead of filtering for the Date column (a DATETIME value), which will do search to find the surrogate keys, you just provide the surrogate keys directly b/c you know the format. Depending on your business domain, you may have other similar useful surrogate key patterns that you may want to consider and use. But just know, in case of a business domain change, you will have have to update all fact records. Simple auto-increment surrogate key does not have that problem. In your case, the surrogate key for the month can be actual month number (1 for Jan)
That being said, 1 million rows in 5 years is easy to query even without a Star-schema design (with proper indexing, database maintenance). But if this is part of a larger analytics system, then go with Star schema
The simplest way.
Create a table, companyname + info you needto store + column for year-month.
Ex:
CREATE TABLE tablename (
id int(11) NOT NULL AUTO_INCREMENT,
companyname varchar(255) ,
info1 int(11) NOT NULL,
info2 datetime ,
info3 varchar(255) ,
info4 bool ,
yearmonth datetime,
PRIMARY KEY (id) );
#queries
select * from tablename where companyname="nameofthecompany";
select * from tablename where yearmonth="year-month"; #can use between here

Is it possible to implement Both SCD1 and SCD2 in a Table -SSIS

I have a table where some columns needs to be updated when the data changes & for some columns the history data needs to be preserved. So, Is it possible to have Both Type 1 Scd and Type 2 Scd in a table?

How to structure database, multiple foreign keys?

I feel like this may be a bit of a unique problem, but hopefully someone out there has come across a similar situation.
My application uses this database table:
DT table
The issue is with Field1 - 9.
Depending on how the user decides to set up their instance of the app there can be any number of fields used (from 0 - 9). The information for these are held in this Table:
Field Table
So for this example there are only to be two fields. And when a record is created for the DT table, field 1 and 2 will have data entered and all other field columns will be NULL. Obviously this isn't good practice, as for one, if a field name was changed in the future, all previous data wouldn't make sense.
I've been trying to think of a way to structure it differently. all I can think of is somehow when a DT record is created it will hold foreign keys to the fields that were used, but it seems that it's not possible to have multiple foreign keys in one column.
Any help or suggestions would be greatly appreciated.
One way to normalize this would be to factor out the repeating fields to a separate table, where you would have one entry per field with DT_id as a foreign key to the DT table.
DT Table:
ID
Start
End
...
DT_field table:
ID
DT_id (foreign key)
Value

Multiple Joins from one Dimension Table to single Fact table

I have a fact table that has 4 date columns CreatedDate, LoginDate, ActiveDate and EngagedDate. I have a dimension table called DimDate whose primary key can be used as foreign key for all the 4 date columns in fact table. So the model looks like this.
But the problem is, when I want to do sub-filtering for the measures based on the date column. For ex: Count all users who were created in the last month and are engaged in this month. This is not possible to do with this design, coz when I filter the measure with create date , I can’t further filter for a different time window for engaged date. Since all the connected to same dimension, they are not working independently.
However, If I create a separate date dimension table for each of the columns, and join them like this then it works.
But this looks very cumbersome when I have 20 different date columns in fact table in real world scenario, where I have to create 20 different dimensions and connect them one by one. Is there any other way I can achieve my scenario w/o creating multiple duplicated date dimensions?
This concept is called a role-playing dimension. You don't have to add the table to the DSV or the actual dimensions one time for each date. Instead add the date once, then go to the dimension usage tab. Click Add Cube Dimension, and then choose the date dim. Right-click and rename it. Then update the relationship to use the correct fields.
There's a good article on MSSQLTips.com that covers this topic.

how to connect generated time dimension with datetime field, ssas

I have a datetime field in my fact table, and I want to group and filter by it, so I created a time dimension and ssas generated a table for it that has a date without time primary key.
What is the right way to connect this generated table to my field, create a view and calculate there additional date with empty time? or there are any other more simple way, maybe using some hierarchies or something? Sorry for probably very simple question its just my second day with analysis servises
How do you fill your fact table. The usual way to this is in the ETL process.
You have a time dimension DIM_Date
With the columns Date_ID, Year, Month, Weekday
Then you should have a fact table FACT_Sales
With the columns Date_ID, Product_ID, Customer_ID, value
When filling the fact table with your ETL-Tool. Let's assume it is SSIS. You do a lookup in the date dimension and set the surrogate key in the facts (Date_ID) correspondingly.
If the tables are allready there. You can use a computed column in the SSAS Data Source View and set your foreign key relationship via drag and drop.