I'm trying to Architecture creating a data warehouse in the Star Schema model... any idea would be appreciated.
Any idea what I should do to create a Star Schema? Some day that I should have a linking table with DimProjects going to the fact tables. What about Project hours? What is the right approach to this or do I need other tables to link? Employee's can work on multiple projects, projects require man hours... etc.
What is the best approach on modeling?
So far I have tables:
[CODE]
Dimension Tables Measure Tables
---------------- --------------
DimEmployee FactCRM
DimProjects FactTargets
DimSalesDetails FactRevenue
DimAccounts
DimTerritories
DimDate
DimTime
[/CODE]
Dimensions in a schema of a datewarehouse means independent entities like for say
Dim_Employee
Empid(pk)
Name
Address etc likewise all other
dimensions
With each dimension keys linked to your fact like in above case
FactCRM would include only crm
related measures and would be linled
To their specific dimensions depending
upon the requirements
Without knowing the columns noone would be able to tell what you want in actual. Also remember linking a dimension to a fact is obviously a partial star schema itself so that doesnt lead to any issues. The only thing is if your dimensions are itself normalized in a schema then it becomes snowflake.
Another thing about fact related if you want to perform manipulation of othwr facts based on somw existing facts then you have to link fact table as well with a unique factid. This is called fact constellation. Then the schema would become star/snowflake schema with facy constellation
Related
I need to create a bus matrix and in order to do that i need to know which fact table has relationships with which dimension tables.
Unfortunately, in this new project I'm in, it seems to be no FK (crazy, i know).
What I thought about is to use ETL queries and check the joins between the Fact table with the dimension tables.
What I'm worried about is that there might be more relationships that are not included in ETL queries...any advice?
You can use the system metadata tables to list the foreign key references:
select tbname, pkcolnames, reftbname, fkcolnames, colcount
from SYSIBM.SYSRELS B;
If the database does not have properly declared foreign key relationships, then the database does not have the information you are looking for.
Assuming the DB holds no information about the FKs (or information that would help you derive them, like identical column names) then, as you mentioned, examining the ETL code used to load each fact table is probably the only other way of doing this. The ETL must be running a look up on each dimension to get the PK to insert into the fact record, so the information will be there.
There shouldn't be any relationships involving facts that you couldn't determine with this approach. There may be additional relationships between dimensions (bridge tables, more complex SCD types, etc.) but if you sorted out the fact relationships then what remains should be a small enough subset to resolve manually (i.e. by intelligent guesswork)
I'm new to the dimensional model in dataware house design, and I face some confusion in my first design.
I take a simple business process (Vacation Request), and I want to ask which of these two designs is accurate and near to be effective and applicable, I will be grateful if I get a detailed answer please? (My question about the dimension design mainly)
1-
Dimension.Employee Fact.Vacation
[Employee Key] [Employee Key] FK PK
[_source key] [Vacation Transaction]PK DD
[Employee Name] ...
.... ...
[Campus Code]
[Campus Name]
[Department Code]
[Department Name]
[Section code]
[Section Name]
....
2-
Dimension.Employee Dimension.Section Fact.Vacation
[Employee Key] [Section Key] [Employee Key] FK PK
[_source key] [_source key] [Vacation Transaction]PK DD
[Employee Name] [Department Code] [Section Key]FK
.... [Department Name] ...
.... [Campus Code]
[Campus Name]
Where the hierarchy is like this:
Campus Contains -->
Many Departments and each department contains -->
many sections and each section contains many employees
Good question! I've faced this situation a number of times myself. I'm afraid this is going to get a bit confusing and the final answer will be, "it depends", but here are some things to consider...
WHAT A STAR SCHEMA REALLY IS: While people think of a data warehouse as a reporting database, it actually performs two functions: data integration and data distribution. It turns out that the best data structures for integration are not great for distribution (here's a blog post I wrote about this some years ago). Star schemas are really about data distribution - getting data out of the system quickly and easily. The best data structures for this have no joins, i.e. they are akin to flat files (yes, I realize there are some DB buffering considerations that might affect this a bit but, in a general sense, indexed flat files do avoid all joins).
The star schema takes that flat file and normalizes it a little, largely to save disk space (it's a huge space waster when you have to write out every attribute of each dimension on every record). So, when people say a star schema is denormalized, they are partially incorrect. The dimension tables are denormalized (snowflake schemas normalize these) but the fact table is normalized - it's got a bunch of attributes dependent on a unique primary key.
So, this concept would point to minimizing the number of dimensions in order to minimize the number of joins you need to make. Point for putting them into one dimension.
FACT TABLE SHOWS RELATIONSHIPS: Your fact table shows the relationship between otherwise unrelated dimension elements. For example, in the absence of a sale, there is no relationship between a product and a customer. A sale creates that relationship and the sale fact record models it. In your case, there is a natural relationship between section and employee (I assume, at least). You don't need a fact table to model this relationship and, therefore, they should both be in one dimension table. Another point for putting them into one dimension.
CAN AN EMPLOYEE BE IN MULTIPLE SECTIONS SIMULTANEOUSLY?: If an employee can be in multiple sections at the same time then you probably do need the fact table to model this relationship (otherwise, each employee would need two active records in the employee dimension table). Point for putting them into separate dimensions.
DO EMPLOYEES CHANGE SECTIONS FREQUENTLY?: If so, and you only have one dimension, you'll end up having to constantly be modifying the employee record in your employee dimension - this leads to a longer than needed dimension table if this is a type two slowly changing dimension (i.e. one where you're tracking the history of changes to dimension elements). Point for putting them into separate dimensions.
DO YOU NEED TO AGGREGATE BY SECTION?: If you have a lot of sections and frequently report at the section level, you may need to create a fact table aggregated at the section level. In this case, if you're a staunch believer in having your DB enforce your relationships (I am), you'll need a section table to which your fact table can relate. In this case you'll need a section table. Point for putting them into separate dimensions.
WILL THE SECTION DIMENSION BE USED BY OTHER FACT TABLES?: One tough situation with star schemas occurs when you're using conformed dimensions (i.e. dimensions that are shared by multiple fact tables). The problem occurs when the different fact tables are defined at different levels in the dimension hierarchy. In your case, imagine there is a different fact table, say one modeling equipment purchases and it only makes sense at the section, not the employee, level. In this case, you'd probably split the section dimension into its own table so it can be shared by both fact tables, your current one and that future, equipment one. This, BTW, is similar to the consideration related to aggregate tables mentioned earlier. Point for putting them into separate dimensions.
Anyhow, that's off the top of my head. So, I guess the answer is, "it depends". Both will work, it just depends on other factors you're trying to optimize for. I'll try to edit my answer if anything else comes to mind. Best of luck with this!
The second.
Employees are a WHO, departments are a WHERE.
I am working with a system, which has 4 databases:
Account (Storing bank accounts, transactions, etc)
Client (Client related info)
Credit (getting rates from 3rd party system)
Quality (Further internal calculation)
I want to create 4 facts tables, one fact table for each database... for example, I will have an Account Fact table with ClientAccount, Transaction, Provider as its dimension table. I will have 3 similar Fact Tables for other databases.
My Question is: does it make sense to include each corresponding fact table in that database? i.e. Create Accounting Fact and Dimension tables in the Account database? Or is it a better to create a new database for all of our star schema, and include all the dimension and fact tables in their own database?
Without knowing too much about the system, I would suggest these are dimension tables rather than fact tables.
A dimension table represents an entity or an object that you can use to construct a fact. Accounts and clients seem like a good fit for this. I'm not sure what Credit and Quality are but they may be dimensions as well.
Your fact table should represent transaction-like records. This could be sales, transactions, phone calls, or whatever your data warehouse is reporting on. This fact table would then have foreign keys to each of the dimension tables.
Regarding a single or multiple databases: I would suggest storing it in a single database. It's easier to use that way, and you don't have to worry about database links when querying your data. Your ETL process for populating these fact and dimension tables can extract the data from these four databases and load it into one database, and from there, you can build the cubes in a single database.
Unless your data volume is very small, your data warehouse should be housed in a separate database from the transactional data. A DW has a different usage pattern (OLTP vs OLAP) and will generally have a different maintenance window.
I would recommend creating all of your Dims and Facts in a single dedicated DW database. I can't think of any benefit to separating them and it would reduce your DBA overhead by not having extra databases to manage/secure/audit/document.
As for Dimensions vs Facts, data from the OLTP Account table would be used to create a Dim and a Fact. DimAccount at the very least would be a degenerate dimension containing just the account number. You'd have to review your data to determine if any of the other records are generic attributes of the Account specifically. FactAccount would contain references to the other Dimensions (DimAccountType, DimCustomer, DimLocation, etc)
Think of the dimensions as the values from lookup tables/dropdown lists, which exist prior to any events happening. For example, a bank can offer Checking & Savings accounts, even if they do not yet have any accounts.
Facts document an event. When an account is created, the fact record will reference all of the dimensions that describe the event, and record the measurable values associated with the event, if any.
I am modelling cube in SSAS. Cube has around 20 dimensions and 6 fact tables. Some of the dimensions are common among the fact tables. e.g. Time dimension. Fact_PNL has 3 date columns for those we have 3 role playing dimensions in the dimension usages.Another fact table has 5 date columns for them as well we have separate role playing dimensions in dimension usage tab. We have a common dimension Company which is foreign key in all fact tables. We might need to combine the data from multiple facts to get final output.
Should i create 6 role playing dimension for each of the fact table or use the same dimension for all fact tables?
Role playing dimensions should be created when we have multiple columns pointing to the same dimension ?
It's up to you. If the role playing dimension plays the same logical role for each fact table, then I would use the same RPD for the same logical role in each fact table. But if you want to use separate ones for each fact table, maybe because you think in the future they might be used differently, then you can.
In short, either way works fine, so whatever makes the most intuitive sense to you and other users is the way you should go.
Yes, that is the purpose of Role Playing Dimensions. When two or more columns in the same fact table reference the same dimension.
I have imported my flatfiles to SQL Server 2012 and created few tables (source tables). I need to build a cube in SSAS. But I need to make "dimension" and "fact" tables it seems with proper PK/FK relations. Could someone tell me whether I need to do:
create an empty dimABC, dimXYZ tables manually with PK identified?
copy data from source tables (imported above) into this new dimXXX tables through some SQL query?
then create a new factXXX table and copy the required facts(data) from source tables above.
Then I need to use these tables during cube build process.
I appreciate your help in clarifying my steps 1,2,3.
You're pretty close on your steps. It sounds like you are new to data warehousing? You might want to check out The Kimball Group's Data Warehouse Toolkit or website to ensure you get your dimensions and facts built correctly.
You have your data in "staging" meaning you have imported your raw data into SQL Server. You will need to create dimension tables with surrogate keys (just auto-incremented identity values) and then create fact tables that use these surrogate keys as foreign keys. You could probably do all of this in straight SQL, but this is what SSIS is for. Once you have your facts and dimensions defined and populated, best practice is to create views to use in the DSV for your cube.
Once you have your views populated and in your DSV in SSAS, you will build the dimensions and facts and then relate them in the cube. If you define the relationships in the DSV, the relationships will be mostly populated in the Dimension usage tab for you.