Data warehouse FactTable - sql

I am trying to use this to learn how about data warehousing and having trouble understanding the concept of the fact table.
http://www.codeproject.com/Articles/652108/Create-First-Data-WareHouse
What would be some queries that I could run to find information from the faceable, and what questions do they answer.

A fact table is used in the dimensional model in data warehouse design. A fact table is found at the center of a star schema or snowflake schema surrounded by dimension tables.
A fact table consists of facts of a particular business process e.g., sales revenue by month by product. Facts are also known as measurements or metrics. A fact table record captures a measurement or a metric.
Example of fact table -
In the schema below, we have a fact table FACT_SALES that has a grain which gives us a number of units sold by date, by store and by product.
All other tables such as DIM_DATE, DIM_STORE and DIM_PRODUCT are dimensions tables. This schema is known as the star schema.

Let's translate this a bit.
Firstly, in a fact table we usually enter numeric values ( rarely Strings , char's ,or other data types).
The purpose of a fact table is to connect with the KEYS of dimensional tables ,other fact tables (fact tables more rarely, and also not a good practice) AND measurements ( and by measurements I mean numbers that change frequently like Prices , Quantities , etc.).
Let's take an example:
Think about a row from a fact table as an product from a supermarket when you pass it by the check out and it get's scanned. What will be displayed in the check out row in your database fact table? Possibly:
Product_ID | ProductName | CustomerID | CustomerName | InventoryID | StoreID | StaffID | Price | Quantity ... etc.
So all those Keys and measurements are bringed together in one fact table, having some big performance and understandability advantage.

Fact Table Contain all Primary key of Dimension table and Measure like "Sales Amount"

A Fact table is a table that stores your measurements of a business process. Here you would record numeric values that apply to an event like a sale in a store. It is surrounded by dimension tables which give the measurement context (which store? which product? which date?).
Using the dimensions you can ask lots of questions of your facts, like how many of a particular product have been sold each month in a region.

Some further info
All dimension keys in a fact should be a FK to the dimension
if a key is unknown it should point to a zero key in the dimension detaialing this.
All joins from a fact to a dim are 1 to 1. bridge tables are is a technique to cater for many to many's, but this is more advanced
All measurements in a fact are numeric, but can contain NULLS if unknown (never put 0 to represent unknown)
When joining facts to dimensions there is no need to do an outer join, due to FKS applied above.
if you have 999 rows in a fact, no matter what dimensions you join to, you should always have 999 rows returned.

Related

Are multiple nullable foreign keys a bad design?

We have a table to store user financial transactions.
A transaction can be incremental or decremental, we have 3 types of transactions, increase with a payment, increase with receiving of a gift, and decrease with purchase of product, So our transaction table contains 3 foreign keys:
[PaymentId] [GiftId] [RequestId]
Is this a bad design? What better alternative is there?
I think it is complicated to join [Transaction] table with 3 other tables to get details of each transaction to display list of user transactions
It seems like you'd be better off with a column called something along the lines of TransactionType and then just storing the TransactionTypeId in a single column as well. Using these two columns you can represent any number of types of Transactions. You actually find this common in the financial data world with dimension tables that represent things like Cost Type, etc.

Store 3-dimensional table in database where 1 dimension increases over time

I have a data set with three dimensions that I would like to store for use with a website:
A list of companies (about 1000)
Information about the company (about 15 things)
Time (monthly)
Essentially, I want to track this information over time and keep it up to date.
When I start, the data will be 1000x15x1, after a year it will be 1000x15x12, and after 10 years if will be 1000x15x120.
The main queries I would make are:
Get all information for one company over all times
Get all information for one particular time
What would be a good database configuration for doing this? I'm open to either SQL or noSQL solutions.
In case it matters, the website is on Google App Engine.
From the relational database schema design perspective:
If the goal is analytics / ad-hoc querying / OLAP in general only, then you can use star-schema which is well suited for these type of analytics. But beware, OLAP databases are de-normalized and not suitable for operational transaction storage / OLTP in general, if you are planning to do both on this database.
The beauty of the Star schema:
The fact tables are usually all numeric, making the tables very small even though there are too many records. Small table means it is very fast to read (I/O).
All joins from the fact table to dimension tables are based on foreign keys (single column, numeric, indexable foreign keys)
All dimension tables have surrogate key, which is a single column primary key. Single column primary key is easier to JOIN than a multi-column primary key and also easier to index.
There is no NULL in foreign keys in fact tables. This makes JOIN operations straightforward, i.e. always JOIN fact table to all of its dimension tables. If you need NULL case, you need to add that as a special case in your dimension table. For example: if a company is not listed on stock market, and one of the thing you track is stock price, then you enter 0 or NULL into the fact for the stock price table depending on (how you want to do SUM(), AVG() etc later) and then add a special case into your StockSymbols dimension table called 'Private company' and add the foreign key of this special case into the fact table as your foreign key.
Almost all filtering is done through the dimension tables that are much much smaller than the fact tables. This requires having a Date dimension to be able to do date-based queries.
If you can stay in pure Star schema, then all yours JOINs are single hop (i.e. no join between two tables through another table).
All these makes JOIN operations very fast, simple and straightforward. That's why the Star schema is at the heart of data-warehousing designs.
https://en.wikipedia.org/wiki/Star_schema
https://en.wikipedia.org/wiki/Data_warehouse
One level up from this is OLAP (SSAS SQL Server Analyses Services for example) which does pre-processing of the data to make it fast to query but it involves more learning than pure start-schema and it's an overkill in your case
For your example
In Star schema,
Companies will be a dimension table
You will need Month dimension table. It's simplified version of Date dimension, just for month info. An example of Date dimension is here.
https://www.codeproject.com/Articles/647950/Create-and-Populate-Date-Dimension-for-Data-Wareho
The information about the company (15 things you say) will be fact tables. The facts must be numeric (b/c ideally all non-numeric values is saved in dimension tables). This means taking the non-numeric part of a fact to a dimension table. For example: if you are keeping revenue and would like to keep the currency type too, then the you will need a Currency dimension and save only the amount in the fact table and a foreign key to the Currency dimension table.
If you have any non-numeric facts, you need to store the distinct list in a dimension table and add foreign key to that dimension table inside your Fact table (this is called factless fact table). The only exception to that is if the cardinality of the dimension and the fact table is very similar, then you can just store the non-numeric fact value inside the fact table directly as there is no benefit in having a dimension table (in fact a disadvantage).
Also the facts can be grouped by their granularity. For example you could have company_monthly_summary fact table and keep more than one fact in that table (which are all joining to Company dimension and Month dimension). This is all up-to-you how you would like to group facts table. But if their granularity are not the same, they should not be grouped as that will cause sparse fact tables and harder to query.
You will use foreign keys in Fact tables to join to your Dimension tables
Add index for your Dimension tables' most used columns
Add a numeric surrogate key to your dimension. It is usually an auto-increment number but that's up-to you. One exception people prefers for the surrogate key of Date dimension is using the format YYYYMMDD (as integer). This makes is easier on WHERE clause: i.e instead of filtering for the Date column (a DATETIME value), which will do search to find the surrogate keys, you just provide the surrogate keys directly b/c you know the format. Depending on your business domain, you may have other similar useful surrogate key patterns that you may want to consider and use. But just know, in case of a business domain change, you will have have to update all fact records. Simple auto-increment surrogate key does not have that problem. In your case, the surrogate key for the month can be actual month number (1 for Jan)
That being said, 1 million rows in 5 years is easy to query even without a Star-schema design (with proper indexing, database maintenance). But if this is part of a larger analytics system, then go with Star schema
The simplest way.
Create a table, companyname + info you needto store + column for year-month.
Ex:
CREATE TABLE tablename (
id int(11) NOT NULL AUTO_INCREMENT,
companyname varchar(255) ,
info1 int(11) NOT NULL,
info2 datetime ,
info3 varchar(255) ,
info4 bool ,
yearmonth datetime,
PRIMARY KEY (id) );
#queries
select * from tablename where companyname="nameofthecompany";
select * from tablename where yearmonth="year-month"; #can use between here

Turn two database tables into one?

I am having a bit of trouble when modelling a relational database to an inventory managament system. For now, it only has 3 simple tables:
Product
ID | Name | Price
Receivings
ID | Date | Quantity | Product_ID (FK)
Sales
ID | Date | Quantity | Product_ID (FK)
As Receivings and Sales are identical, I was considering a different approach:
Product
ID | Name | Price
Receivings_Sales (the name doesn't matter)
ID | Date | Quantity | Type | Product_ID (FK)
The column type would identify if it was receiving or sale.
Can anyone help me choose the best option, pointing out the advantages and disadvantages of either approach?
The first one seems reasonable because I am thinking in a ORM way.
Thanks!
Personally I prefer the first option, that is, separate tables for Sales and Receiving.
The two biggest disadvantage in option number 2 or merging two tables into one are:
1) Inflexibility
2) Unnecessary filtering when use
First on inflexibility. If your requirements expanded (or you just simply overlooked it) then you will have to break up your schema or you will end up with unnormalized tables. For example let's say your sales would now include the Sales Clerk/Person that did the sales transaction so obviously it has nothing to do with 'Receiving'. And what if you do Retail or Wholesale sales how would you accommodate that in your merged tables? How about discounts or promos? Now, I am identifying the obvious here. Now, let's go to Receiving. What if we want to tie up our receiving to our Purchase Order? Obviously, purchase order details like P.O. Number, P.O. Date, Supplier Name etc would not be under Sales but obviously related more to Receiving.
Secondly, on unnecessary filtering when use. If you have merged tables and you want only to use the Sales (or Receving) portion of the table then you have to filter out the Receiving portion either by your back-end or your front-end program. Whereas if it a separate table you have just to deal with one table at a time.
Additionally, you mentioned ORM, the first option would best fit to that endeavour because obviously an object or entity for that matter should be distinct from other entity/object.
If the tables really are and always will be identical (and I have my doubts), then name the unified table something more generic, like "InventoryTransaction", and then use negative numbers for one of the transaction types: probably sales, since that would correctly mark your inventory in terms of keeping track of stock on hand.
The fact that headings are the same is irrelevant. Seeking to use a single table because headings are the same is misconceived.
-- person [source] loves person [target]
LOVES(source,target)
-- person [source] hates person [target]
HATES(source,target)
Every base table has a corresponding predicate aka fill-in-the-[named-]blanks statement describing the application situation. A base table holds the rows that make a true statement.
Every query expression combines base table names via JOIN, UNION, SELECT, EXCEPT, WHERE condition, etc and has a corresponding predicate that combines base table predicates via (respectively) AND, OR, EXISTS, AND NOT, AND condition, etc. A query result holds the rows that make a true statement.
Such a set of predicate-satisfying rows is a relation. There is no other reason to put rows in a table.
(The other answers here address, as they must, proposals for and consequences of the predicate that your one table could have. But if you didn't propose the table because of its predicate, why did you propose it at all? The answer is, since not for the predicate, for no good reason.)

Fact Table Recommendation

I have a data mart which only needs to capture a serial number of a product, the date of the activity, and where the activity took place (which account).
There are five possible activities. The issue I have is this. Two of the activities take place at a warehouse level. The remaining three take place at the account-level (WH does not apply). Ultimately however every warehouse rolls up to a master account.
So if I had one fact table, I would essentially need two FK and you would have to traverse the fact table to build the WH > Account hierarchy which seems hard to maintain. I'd like one dimension table.
Or is it then recommended I split this into two fact tables, even though the only different characteristic of either table is whether the activity took place at the warehouse or not.
The goal of the reporting will be at the account level, but having the WH information may be useful at some point. And I need to check for duplicates, etc which is why I was leaning towards the first, but don't know how to appropriately handle the hierarchies.
Single Fact Table Design
Item: 1
Account: 14
Warehouse:2
ActivityType:3
Date: 20130204
SerialNumber:123456
Count:1
Dual Fact Table Design
Table 1
Item: 1
Warehouse:2
ActivityType:3
Date: 20130204
SerialNumber:123456
Count:1
Table 2
Item: 1
Account:2
ActivityType:3
Date: 20130204
SerialNumber:123456
Count:1
Ive interpreted you situation as:
ALL activities require an account
Some activities involve a
warehouse.
The selection of warehouse implies an account. the
accounts mentioned in the two point above are of the same type (there
is only 1 account dimension table)
In which case you should be OK with the single FACT table design:
[ACTIVITY_FACT]
SK (Optional, i find unique surrogate PKs useful)
ITEM_SK (Link to your ITEM_DIM table)
ACCOUNT_SK (Link to your ACCOUNT_DIM table)
WAREHOUSE_SK (Link to your WAREHOUSE_DIM table, -1 for no warehouse activities)
ACTIVITY_TYPE_SK (Link to your ACTIVITY_TYPE_DIM table)
ACTIVITY_DATE_SK (Link to your DATE_DIM table)
ITEM_SERIAL_NUMBER
ITEM_COUNT
Have a record in your WAREHOUSE dimension for NONE or NOT APPLICABLE and allocate it a nice obvious special condition SK value of -1 or -9 or whatever your shop is using for such things.
For activity records that reference a warehouse, put the appropriate warehouse sk AND the account sk that belong to that warehouse.
For activities that do not involve a warehouse, populate the warehouse sk with the NONE / NOT APPLICABLE warehouse dimension record and the appropriate Account SK.
Now your fact table can be joined to your Account and Warehouse dimension tables without having to worry about outer join or null condition handling. This should allow you and your users to play about with warehouse dimension data as required and your not having to faff about with managing two tables that contain essentially the same date.
A possibility is to define the hierarchy in a single dimension table. Guessing at what you’re dealing with, I came up with the following.
Outline of dimension table:
TABLE: Account
Account_ID <surrogate key>
Account <Account name, identifier>
Warehouse (Warehouse name, identifier)
Sample data:
Account_ID Account Warehouse
1 A n/a
2 B n/a
3 C n/a
4 W wh1
5 W wh2
6 Z wh3
7 Z n/a
Account_ID is just a surrogate key, having no intrinsic meaning or value
Account lists the accounts. Here, I shows five, A, B, C, W and Z. Select distinct to get the list of accounts; join to a fact table by Account_ID where Account = “W” gets all data for that account (for however many warehouses, if applicable).
Warehouse lists all warehouses and the account they are associated with; here, “W” is the account for two separate warehouses (wh1, wh2); Z is associated with warehouse wh3, but could also be used by a fact table with “no” warehouse. Join to a fact table by Account_ID where Warehouse = “wh1” gets all data for that warehouse.
Using this, with Account_ID in a fact table you could drill down for all entries for any given Account or for a specific warehouse (or for no warehouse, if there is value in that).
There are lots of variations and permutations possible with this kind of approach.

Architecture of SQL tables

I am wondering is it more useful and practical (size of DB) to create multiple tables in sql with two columns (one column containing foreign key and one column containing random data) or merge it and create one table containing multiple columns. I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
example a. one table
productID productname weight no_of_pages
1 book 130 500
2 watch 50 null
3 ring null null
example b. three tables
productID productname
1 book
2 watch
3 ring
productID weight
1 130
2 50
productID no_of_pages
1 500
The multi-table approach is more "normal" (in database terms) because it avoids columns that commonly store NULLs. It's also something of a pain in programming terms because you have to JOIN a bunch of tables to get your original entity back.
I suggest adopting a middle way. Weight seems to be a property of most products, if not all (indeed, a ring has a weight even if small and you'll probably want to know it for shipping purposes), so I'd leave that in the Products table. But number of pages applies only to a book, as do a slew of other unmentioned properties (author, ISBN, etc). In this example, I'd use a Products table and a Books table. The books table would extend the Products table in a fashion similar to class inheritance in object oriented program.
All book-specific properties go into the Books table, and you join only Products and Books to get a complete description of a book.
I think this all depends on how the tables will be used. Maybe your examples are oversimplifying things too much but it seems to me that the first option should be good enough.
You'd really use the second example if you're going to be doing extremely CPU intensive stuff with the first table and will only need the second and third tables when more information about a product is needed.
If you're going to need the information in the second and third tables most times you query the table, then there's no reason to join over every time and you should just keep it in one table.
I would suggest example a, in case there is a defined set of attributes for product, and an example c if you need variable number of attributes (new attributes keep coming every now and then) -
example c
productID productName
1 book
2 watch
3 ring
attrID productID attrType attrValue
1 1 weight 130
2 1 no_of_pages 500
3 2 weight 50
The table structure you have shown in example b is not normalized - there will be separate id columns required in second and third tables, since productId will be an fk and not a pk.
It depends on how many rows you are expecting on your PRODUCTS table. I would say that it would not make sense to normalize your tables to 3N in this case because product name, weight, and no_of_pages each describe the products. If you had repeating data such as manufacturers, it would make more sense to normalize your tables at that point.
Without knowing the background (data model), there is no way to tell which variant is more "correct". both are fine in certain scenarios.
You want three tables, full stop. That's best because there's no chance of watches winding up with pages (no pun intended) and some books without. If you normalize, the server works for you. If you don't, you do the work instead, just not as well. Up to you.
I am asking this because in my scenario one product holding primary key could have sufficient/applicable data for only one column while other columns would be empty.
That's always true of nullable columns. Here's the rule: a nullable column has an optional relationship to the key. A nullable column can always be, and usually should be, in a separate table where it can be non-null.