Proper Way to Key Data Warehouse Fact Table

Proper Way to Key Data Warehouse Fact Table - sql

When keying a FACT table in a data warehouse, is it better best to use the primary key from the foreign table or the unique key or identifier used by the business?
For example (see below illustration), assume you have two dimension tables "DimStores" and "DimCustomers" and one FACT table named "FactSales". Both of the dimension tables have an indexed primary key field that is an integer data type and is named "ID". They also have an indexed unique business key field that is a alpha-numeric text data type named "Number".
Typically you'd use the primary key of dimension tables as the foreign keys in the FACT table. However, I'm wondering if that is the best approach.
By using the primary key, in order to look up or do calculations on the facts in the FACT table, you'd likely have to always do a join query on the primary key and use the business key as your look up. The reason is because most users won't know the primary key value to do a lookup in the FACT table. They will, however, likely know the business key. Therefore to use that business key you'd have to do a join query to make the relationship.
Since the business key is indexed anyway, would it be better to just use that as the foreign key in the FACT table? That way you wouldn't have to do a join and just do your lookup or calculations directly?
I guess it boils down to whether join queries are that expensive? Imagine you're dealing with a billion record FACT table and dimensions with tens of millions of records.
Example tables:
DimStores:
+------------+-------------+-------------+
| StoreId | StoreNumber | StoreName |
+------------+-------------+-------------+
| 1 | S001 | Los Angeles |
| 2 | S002 | New York |
+------------+-------------+-------------+
DimCustomers:
+------------+----------------+--------------+
| CustomerId | CustomerNumber | CustomerName |
+------------+----------------+--------------+
| 1 | S001 | Michael |
| 2 | S002 | Kareem |
| 3 | S003 | Larry |
| 4 | S004 | Erving |
+------------+----------------+--------------+
FactSales:
+---------+------------+------------+
| StoreId | CustomerId | SaleAmount |
+---------+------------+------------+
| 1 | 1 | $400 |
| 1 | 2 | $300 |
| 2 | 3 | $200 |
| 2 | 4 | $100 |
+---------+------------+------------+
In the above to get the total sales for the Los Angles store I'd have to do this:
Select Sum(SaleAmount)
From FactSales FT
Inner Join DimStores D1 ON FT.StoreId = D1.StoreId
Where D1.StoreNumber = 'S001'
Had I used the "StoreNumber" and "CustomerNumber" fields as the foreign keys instead in the "FactSales" table. I wouldn't have had to do a join query and could have directly done this instead:
Select Sum(SaleAmount)
From FactSales
Where StoreNumber = 'S001'

The reason you use artificial primary keys is to isolate the data warehouse from business decisions.
Your business grows. Now you have more than 1000 stores. The keys for the stores change. How do you handle this?
If the store key is spread throughout your data warehouse, this is a painful operation. If the store key is just an attribute on a dimension table, then this is easy.
I should also note that in many cases, the dimensions might be type 2 dimensions -- meaning that they change over time. For instance, customers can change their names, but you might want to know what their name was at a particular point in time.
And a third reason. Artificial primary keys are usually integers. These are better for indexing than strings (particularly strings with variable lengths). The difference in performance is minor, but it is a reason to use the primary keys. In fact, if the keys are strings and are longer than integers, it might be more efficient to use the artificial keys in terms of space.

Related

Distinct performance in Redshift

I am trying to populate a multiple dimension tables from single Base table.
Sample Base Table:
| id | empl_name | emp_surname | country | dept | university |
|----|-----------|-------------|---------|------|------------|
| 1 | AAA | ZZZ | USA | CE | U_01 |
| 2 | BBB | XXX | IND | CE | U_01 |
| 3 | CCC | XXX | CAN | IT | U_02 |
| 4 | CCC | ZZZ | USA | MECH | U_01 |
Required Dimension tables :
emp_name_dim with values - AAA,BBB,CCC
emp_surname_dim with values - ZZZ,XXX
country_dim with values - USA,IND,CAN
dept_dim with values - CE,IT,MECH
university_dim with values - U_01,U_02
Now to populate above dimension tables from base table, I am thinking of 2 approaches
Get distinct values from base table for all above columns combination, create single temp table out of that and use that temp table for subsequent individual dimension table creation. Here, I will be reading data from base table only once but with more column combination.
Create separate temp tables for distinct values specific to each dimension. This way we need to read base table for multiple times, but created temp table will be smaller(i.e. less number of rows and only single column's distinct values).
Which approach is better if we consider for performance?
Note :
Base table is huge containing millions of rows.
Above columns are just for sample. In actual table there are around 50 columns for
which I need to consider for distinct combination.

Scanning the large table only once is the way to go.
Also there is another way to get the distinct values which in some cases will be faster than distinct. As an alternative approach perform a "group by" on all the columns. Run this as a bake-off to see which is faster. In general if there will be a small number (fits in memory) number of resulting rows from distinct, then distinct will be faster. However, if the result will be large then group by will be faster. There are a lot of corner-cases and factors (distribution style) that can impact this rule-of-thumb so testing both for speed will give you which is faster in your case.
Given that you have 50 columns and you want all the unique combination I'd guess that the output set will be large and that group by will wind but this is just a guess.

Auto generate columns in Microsoft Access table

How can we auto generate column/fields in microsoft access table ?
Scenario......
I have a table with personal details of my employee (EmployDetails)
I wants to put their everyday attendance in an another table.
Rather using separate records for everyday, I want to use a single record for an employ..
Eg : I wants to create a table with fields like below
EmployID, 01Jan2020, 02Jan2020, 03Jan2020,.........25May2020 and so on.......
It means everyday I have to generate a column automatically...
Can anybody help me ?

Generally you would define columns manually (whether that is through a UI or SQL).
With the information given I think the proper solution is to have two tables.
You have your "EmployDetails" which you would put their general info (name, contact information etc), and the key which would be the employee ID (unique, can be autogenerated or manual, just needs to be unique)
You would have a second table with a foreign key to the empployee ID in "EmployDetails" with a column called Date, and another called details (or whatever you are trying to capture in your date column idea).
Then you simply add rows for each day. Then you do a join query between the tables to look up all the "days" for an employee. This is called normalisation and how relational databases (such as Access) are designed to be used.
Employee Table:
EmpID | NAME | CONTACT
----------------------
1 | Jim | 222-2222
2 | Jan | 555-5555
Detail table:
DetailID | EmpID (foreign key) | Date | Hours_worked | Notes
-------------------------------------------------------------
10231 | 1 | 01Jan2020| 5 | Lazy Jim took off early
10233 | 2 | 02Jan2020| 8 | Jan is a hard worker
10240 | 1 | 02Jan2020| 7.5 | Finally he stays a full day
To find what Jim worked you do a join:
SELECT Employee.EmpID, Employee.Name, Details.Date, Details.Hours_worked, Details.Notes
FROM Employee
JOIN Details ON Employee.EmpID=Details.EmpID;
Of course this will give you a normalised result (which is generally what's wanted so you can iterate over it):
EmpID | NAME | Date | Hours_worked | Notes
-----------------------------------------------
1 | Jim | 01Jan2020 | 5 | ......
1 | Jim | 02Jan2020 | 7 | .......
If you want the results denormalised you'll have to look into pivot tables.
See more on creating foreign keys

How to deal with a 'self' relationship in SQL?

In our application, we have clients and each client has a list of customers
client table:
id | name
-------------
1 | happy
2 | bashful
customer table:
id | client_id | name
----------------------------------------
50 | 1 | happys first customer
51 | 1 | happys second customer
52 | 2 | bashfuls first customer
Without going into too much detail, each client is going to have a list of prices that apply to them. For simplicity's sake, we'll say we also have a product table with product ids 1,2 and 3, and every customer will have a unique price against each item. So customer 50 will have 3 rows, customer 51 will have 3 rows, and customer 52 will have 3 rows in this price table.
price table:
id | customer_id | product_id |
----------------------------------------
50 | 50 | 1 | 4.99
51 | 50 | 2 | 6.20
52 | 50 | 3 | 8.00
...
Now here's the kicker: each client should also have their own rows on this price table. We'll refer to this client price list as the 'base list', because in the context of the app it's what all the customer prices will be compared against.
There are three immediately obvious solutions to me, but I'm not sure if any of them are right, or which one is optimal:
Solution 1
Add a row into the customer table where the name is something like 'self', so that 'self' can be treated almost like a client
.
Solution 2
Make the price table have two foreign key columns, one with customer_id and one with client_id, and allow customer_id to be null -- if customer_id is null, I know that row is the client row.
.
Solution 3
Have 2 price tables that are basically identical, one to foreign-key into customers and one to foreign-key into clients.

It is a good idea to declare foreign key relationships. No database that I know of supports a conditional foreign key relationships, so that eliminates having one column for both clients and their customers.
You have not specified if customers are unique to clients, so let me assume that they are not.
That suggests that Options 2 and 3 are the most reasonable. There is actually little to separate them. With a single table, you want a check constraint that exactly one of the ids is set -- unless you have customers shared across clients and you are allowing client-specific, customer-specific, and customer-client specific prices.
The more important consideration, I think, is that prices and relationships change over time. You should be thinking about how to incorporate effective and end dates into the data model to capture this information.

Postgres / SQL Databases: How to enforce unique combination of key/value Pairs

a new Project requires a dynamic datamodel, meaning that the Properties for a record are stored in a seperate table like this:
Items:
ID | insertiondate
1 | 2017-01-31
Properties:
ID | fk_Item_ID | Key | Value
1 | 1 | referenceNr | 1
2 | 1 | office | O1
...
What i need now is a possibility to enforce that a "referenceNumber" in unique per "office".
so the insertion into this table with the 2 values (1, O2) is ok as well as (2, O1) - but (1, O1) has to violate the Constraint.
Is there a simple way to handle this?

Even if the project really asks for some key/value entries, this doesn't seem to be true for referencenr and office as you want to apply constraints on the pair. So simply put the two in your items table and add the constraint.
The only other option I see is to make the two one entry:
ID | fk_Item_ID | Key | Value
1 | 1 | 'referenceNr/office' | '1/01'
I'd go for the first solution. Have key/value pairs only where absolutely necessary (and where the DBMS may be oblivious as to their content and mutual relations).

nullable foreign key columns denormalized table vs many normalized tables

In our entitlement framework each "resource" (resource is nothing but any entity that you want to protect by assigning privileges to roles which can access or not access based on privileges) is stored in a resource table like below.
DESIGN1
RESOURCE TABLE
id (int) | namespace (varchar) | entity_id | black_listed (boolean)
1 | com.mycompany.core.entity1 |24 | false
2 | com.mycompany.core.entity2 |24 | false --note entity2
3 | com.mycompany.core.entity10 |3 | false -- note entity10
each resource in the table represent different entity e.g. entity1,entity2,..,entity10. basically that's nothing but entity1.id, entity2.id, entity3.id, ... and so on. because RESOURCE table keeps resources for all kinds of entity - entity_id column in RESOURCE table can't have proper foreign key relationship constraint. we are thinking to refactor this schema such as follow
DESIGN 2
RESOURCE TABLE
id | description | entity1_id | entity2_id | entity3_id | entity4_id | entity5_id | entity6_id | black_listed(boolean)
1 | com.mycompany.core.entity1|24 | null | null | null |null | null
2 | com.mycompany.core.entity2|null | 24 | null | null |null | null
now entity1_id will have a proper FK to entity1 table , entity2_id will have proper FK to entity2 and so on. downside of this approach is that this table will always have null values for all the columns BUT one. e.g. you can only have one entity resource per row. also having null seems to be anti pattern especially for FK relationship. One other way would be normalize the schema and create a resource table for each enitty type. but that will be pretty insane to maintain and quickly become a headache. not saying it's good or bad but doesn't look like a practical design.
is there a better way to design such a table where proper FK relatinoships are also maintained? or you'll endorse Design 2?

You need to create one table for all entities with id (surrogate primary key) or entity_type, entity_id as unique key.
id entity_type entity_id
1 entity1 24
2 entity2 24
Then you need to have only one column in RESOURCE referring to this table (say entities). Your RESOURCE table will look like as in the first example, but the difference is there will be only one entities table, not 10.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas