How to create history fact table? - sql

I have some entities in my Data Warehouse:
Person - with attributes personId, dateFrom, dateTo, and others those can be changed, e.g. last name, birth date and so on - slowly changing dimension
Document - documentId, number, type
Address - addressId, city, street, house, flat
The relations between (Person and Document) is One-To-Many and (Person and Address) is Many-To-Many.
My target is to create history fact table that can answer us following questions:
What persons with what documents lived at defined address on defined date?
2, What history of residents does defined address have on defined interval of time?
This is not only for what DW is designed, but I think it is the hardest thing in DW's design.
For example, Miss Brown with personId=1, documents with documentId=1 and documentId=2 had been lived at address with addressId=1 since 01/01/2005 to 02/02/2010 and then moved to addressId=2 where has been lived since 02/03/2010 to current date (NULL?). But she had changed last name to Mrs Green since 04/05/2006 and her first document with documentId=1 to documentId=3 since 06/07/2007. Mr Black with personId=2, documentId=4 has been lived at addressId=1 since 02/03/2010 to current date.
The expected result on our query for question 2 where addressId=1, and time interval is since 01/01/2000 to now, must be like:
Rows:
last_name="Brown", documentId=1, dateFrom=01/01/2005, dateTo=04/04/2006
last_name="Brown", documentId=2, dateFrom=01/01/2005, dateTo=04/04/2006
last_name="Green", documentId=1, dateFrom=04/05/2006, dateTo=06/06/2007
last_name="Green", documentId=2, dateFrom=04/05/2006, dateTo=06/06/2007
last_name="Green", documentId=2, dateFrom=06/07/2007, dateTo=02/01/2010
last_name="Green", documentId=3, dateFrom=06/07/2007, dateTo=02/01/2010
last_name="Black", documentId=4, dateFrom=02/03/2010, dateTo=NULL
I had an idea to create fact table with composite key (personId, documentId, addressId, dateFrom) but I have no idea how to load this table and then get that expected result with this structure.
I will be pleased for any help!

Interesting question #Argnist!
So to create some common language for my example, you want a
DimPerson (PK=kcPerson, suggorate key for unique Persons=kPerson, type 2 dim)
DimDocument (PK=kcDocument, suggorate key for unique Documents=kDocument, type 2 dim)
DimAddress (PK=kcAddress, suggorate key for unique Addresses=kAddress, type 2 dim)
A colleague has written a short blog on the usage of two surrogate keys to explain the above dims 'Using Two Surrogate Keys on Dimensions'.
I would always add
DimDate with PK in the form yyyymmdd
to any data warehouse with extra attribute columns.
Then you would have your fact table as
FactHistory (FKs=kcPerson, kPerson, kcDocument, kDocument, kcPerson, kPerson, kDate)
plus any aditional measures.
Then joining on the "kc"s you can show the current Person/Document/Address dimension information.
If you joined on the "k"s you can show the historic Person/Document/Address dimension information.
The downside of this is that this fact table needs one row for each person/document/address/date combination. But it really is a very narrow table, since the table just has a number of foreign keys.
The advantage of this is it is very easy to query for the sorts of questions you were asking.
Alternatively, you could have your fact table as
FactHistory (FKs=kcPerson, kPerson, kcDocument, kDocument, kcPerson, kPerson, kDateFrom, kDateTo)
plus any aditional measures.
This is obviously much more compact, but the querying becomes more complex. You could also put a view over the Fact table to make it easier to query!
The choice of solution depends on the frequency of change of the data. I suspect that it will not be changing that quickly, so teh alternate design of the fact table may be better.
Hope that helps.

Related

I need help counting char occurencies in a row with sql (using firebird server)

I have a table where I have these fields:
id(primary key, auto increment)
car registration number
car model
garage id
and 31 fields for each day of the mont for each row.
In these fields I have char of 1 or 2 characters representing car status on that date. I need to make a query to get number of each possibility for that day, field of any day could have values: D, I, R, TA, RZ, BV and LR.
I need to count in each row, amount of each value in that row.
Like how many I , how many D and so on. And this for every row in table.
What best approach would be here? Also maybe there is better way then having field in database table for each day because it makes over 30 fields obviously.
There is a better way. You should structure the data so you have another table, with rows such as:
CarId
Date
Status
Then your query would simply be:
select status, count(*)
from CarStatuses
where date >= #month_start and date < month_end
group by status;
For your data model, this is much harder to deal with. You can do something like this:
select status, count(*)
from ((select status_01 as status
from t
) union all
(select status_02
from t
) union all
. . .
(select status_31
from t
)
) s
group by status;
You seem to have to start with most basic tutorials about relational databases and SQL design. Some classic works like "Martin Gruber - Understanding SQL" may help. Or others. ATM you miss the basics.
Few hints.
Documents that you print for user or receive from user do not represent your internal data structures. They are created/parsed for that very purpose machine-to-human interface. Inside your program should structure the data for easy of storing/processing.
You have to add a "dictionary table" for the statuses.
ID / abbreviation / human-readable description
You may have a "business rule" that from "R" status you can transition to either "D" status or to "BV" status, but not to any other. In other words you better draft the possible status transitions "directed graph". You would keep it in extra columns of that dictionary table or in one more specialized helper table. Dictionary of transitions for the dictionary of possible statuses.
Your paper blank combines in the same row both totals and per-day detailisation. That is easy for human to look upon, but for computer that in a sense violates single responsibility principle. Row should either be responsible for primary record or for derived total calculation. You better have two tables - one for primary day by day records and another for per-month total summing up.
Bonus point would be that when you would change values in the primary data table you may ask server to automatically recalculate the corresponding month totals. Read about SQL triggers.
Also your triggers may check if the new state properly transits from the previous day state, as described in the "business rules". They would also maybe have to check there is not gaps between day. If there is a record for "march 03" and there is inserted a new the record for "march 05" then a record for "march 04" should exists, or the server would prohibit adding such a row. Well, maybe not, that is dependent upon you business processes. The general idea is that server should reject storing any data that is not valid and server can know it.
you per-date and per-month tables should have proper UNIQUE CONSTRAINTs prohibiting entering duplicate rows. It also means the former should have DATE-type column and the latter should either have month and year INTEGER-type columns or have a DATE-type column with the day part in it always being "1" - you would want a CHECK CONSTRAINT for it.
If your company has some registry of cars (and probably it does, it is not looking like those car were driven in by random one-time customers driving by) you have to introduce a dictionary table of cars. Integer ID (PK), registration plate, engine factory number, vagon factory number, colour and whatever else.
The per-month totals table would not have many columns per every status. It would instead have a special row for every status! The structure would probably be like that: Month / Year / ID of car in the registry / ID of status in the dictionary / count. All columns would be integer type (some may be SmallInt or BigInt, but that is minor nuancing). All the columns together (without count column) should constitute a UNIQUE CONSTRAINT or even better a "compound" Primary Key. Adding a special dedicated PK column here in the totaling table seems redundant to me.
Consequently, your per-day and per-month tables would not have literal (textual and immediate) data for status and car id. Instead they would have integer IDs referencing proper records in the corresponding cars dictionary and status dictionary tables. That you would code as FOREIGN KEY.
Remember the rule of thumb: it is easy to add/delete a row to any table but quite hard to add/delete a column.
With design like yours, column-oriented, what would happen if next year the boss would introduce some more statuses? you would have to redesign the table, the program in many points and so on.
With the rows-oriented design you would just have to add one row in the statuses dictionary and maybe few rows to transition rules dictionary, and the rest works without any change.
That way you would not

Data warehouse type 2 scd Employee dimension and HR Facts (Kimball's)

I'm following the "Data Warehouse Toolkit" book by Kimball and I'm getting confused with an example of an employee dimension and a HR snapshot fact table.
Here is a screenshot of the example given in the book:
I'm getting confused with the 'Employee count', 'New Hire Count', 'Transfer Count' and 'Promotion Count' fields. As you can see there is a relationship between the HR fact table and the Employee dimension table but which key would be assigned in the fact table in the case of these count values?. I understand that there could be a 'New Hire Count' at the end of the month and we would have Month Dimension FK in the fact table pointing to that month, but what about the employee dimension key?
I hope I'm making myself clear here, and sorry if somehow this is a dumb question.
Thanks.
stigma,
I believe the surrogate key of the most recent row from Employee Transaction Dimension is what one would include in the fact table. There may be multiple rows in Employee Transaction Dimension for the month, but only the most recent one would be referenced in the fact table row for a given month.
Hope this helps.
Best Regards,
Jesse Dyson
In a SCD (type 2 or type 3), you want to think in terms of 2 types of key; natural keys and pseudo keys. The natural key is the identifier which the "real world" would understand, in the example of an Employee dimension, this would probably be some kind of Employee Id. Each time you add an entry to this table, you get a new pseudo key, and I like to think of this as the "as-was" key. It represents the state of that dimensional member "as it was", when the record was added.
Over time in a SCD, you will have many, many records per natural key, each with it's own "as-was" key. Considering the most recent entry, it's "as-was" key is also the "as-is" key, as it represents the current state.
In a fact table, you should ALWAYS expect to find the "as-was" key. If you're going to assume the fact table will always hold the "as-is" key, or the most recent key, then it assumes you're going to go back and update historical records in your fact table simply because an attribute of the dimension changed. This is a waste of resources for started, and is actually counter-productive as one of the major benefits of a SCD is the ability to do "as-was vs as-is" analysis, and to do this you need to preserve the "as-was" state.

Why is this table not normalized?

I am taking a database course and I am studying table normalization.
Could anyone explain to me, why the second table in the first row on the right not normalized?
It is not normalized because
For a student who has signed for more than one course, the entries in the table will be:
23 Jake Smith CS101 B+
23 Jake Smith B102 C+
Clearly the data is being repeated(redundant data). It is leading to anomalies(insert, update, delete anomalies).
Ex:When you have to change the name of a Student say Jake Smith, you have to modify all of the rows,this is called an update anomalie.
Normalization is used to avoid these kind of anomalies and redundant data storage.
The table on the right hand side in the second row handles this situation in a better way, as it stores id, name and DOB in a separate table, the edits can be made easily using id attribute on a single row.
There are several normal forms like 1NF, 2NF, 3NF etc. Each normal form has some constraints associated with it. Each Higher form being stricter than the previous one.
I suppose it is table for students grades. It is not normalized because it contains students names directly, instead of references to students records.
It's better not to include student_name into this table, but store all students data in separate students table and reference it by student_id foreign key (something like first table in second row except the ids.).
It's not normalised because neither id nor student_name is the key (both have duplicates) so the key must be one of those (probably id) together with the course code. The other one (name) then doesn't depend on that key, but just on id.
The simple rule for 3NF is that every non-key column must depend on "the key, the whole key, and nothing but the key" - to which we all solemnly intone "so help me Codd"!
The higher normal forms deal with dependencies inside the parts of a key.
Because in your first right table you have twice values
23 - j.smith
that is repeated and do not adhere to Codd 1 normal form

I keep messing up 1NF

For me the most understandable description of going about 1NF so far I found is ‘A primary key is a column (or group of columns) that uniquely identifies each row. ‘ on www.phlonx.com
I understand that redundancy means per key there shouldn’t be more than 1 value to each row. More than 1 value would then be ‘redundant’. Right?
Still I manage to screw up 1 NF a lot of times.
I posted a question for my online pizzashop http://foo.com
pizzashop
here
where I was confused about something in the second normal form only to notice I started off wrong in 1 NF.
Right now I’m thinking that I need 3 keys in 1NF in order to uniquely identify each row.
In this case, I’m finding that order_id, pizza_id, and topping_id will do that for me. So that’s 3 columns. Because if you want to know which particular pizza is which you need to know what order_id it has what type of pizza (pizza_id) and what topping is on there. If you know that, you can look up all the rest.
Yet, from an answer to previous question this seems to be wrong, because topping_id goes to a different table which I don’t understand.
Here’s the list of columns:
Order_id
Order_date
Customer_id
Customer_name
Phone
Promotion
Blacklist Y or N
Customer_address
ZIP_code
City
E_mail
Pizza_id
Pizza_name
Size
Pizza_price
Amount
Topping_id
Topping_name
Topping_prijs
Availabitly
Delivery_id
Delivery_zone
Deliveryguy_id
Deliveryguy_name
Delivery Y or N
Edit: I marked the id's for the first concatenated key in bold. They are only a list of columns, unnormalized. They're not 1 table or 3 tables or anything
use Object Role Modelling (say with NORMA) to capture your information about the design, press the button and it spits out SQL.
This will be easier than having you going back and forth between 1NF, 2NF etc. An ORM design is guaranteed to be in 5NF.
Some notes:
you can have composite keys
surrogate keys may be added after both conceptual and logical design: you have added them up front which is bad. They are added because of the RDBMS performance, not at design time
have you read several sources on 1NF?
start with plain english and some facts. Which is what ORM does with verbalisation.
So:
A Customer has many pizzas (zero to n)
A pizza has many toppings (zero to n)
A customer has an address
A pizza has a base
...
I'd use some more tables for this, to remove duplication for customers, orders, toppings and pizze:
Table: Customer
Customer_id
Customer_name
Customer_name
Phone
Promotion
Blacklist Y or N
Customer_address
ZIP_code
City
E_mail
Table: Order
Order_id
Order_date
Customer_id
Delivery_zone
Deliveryguy_id
Deliveryguy_name
Delivery Y or N
Table: Order_Details
Order_ID (FK on Order)
Pizza_ID (FK on Pizza)
Amount
Table: Pizza
Pizza_id
Pizza_name
Size
Pizza_price
Table: Topping
Topping_id
Topping_name
Topping_prijs
Availabitly
Table: Pizza_Topping
Pizza_ID
Topping_ID
Pizza_topping and Order_details are so-called interselection tables ("helper" tables for modelling a m:n relationship between two tables).
Now suppose we have just one pizza, some toppings and our customer Billy Smith orders 2 quattro stagione pizze - our tables will contain this content:
Pizza(Pizza_ID, Pizza_name, Pizza_price)
1 Quattro stagioni 12€
Topping(Topping_id, topping_name, topping_price)
1 Mozzarrella 0,50€
2 Prosciutto 0,70€
3 Salami 0,50€
Pizza_Topping(Pizza_ID, Topping_ID)
1 1
1 3
(here, a quattro stagioni pizza contains only Mozzarrella and Salami).
Order(order_ID, Customer_name - rest omitted)
1 Billy Smith
Order_Details(order_id, Pizza_id, amount)
1 1 2
I've removed delivery ID, since for me, there is no distinction between an Order and a delivery - or do you support partial deliveries?
On 1NF, from wikipedia, quoting Date:
According to Date's definition of 1NF,
a table is in 1NF if and only if it is
"isomorphic to some relation", which
means, specifically, that it satisfies
the following five conditions:
There's no top-to-bottom ordering to the rows.
There's no left-to-right ordering to the columns.
There are no duplicate rows.
Every row-and-column intersection contains exactly one
value from the applicable domain (and
nothing else).
All columns are regular [i.e. rows have no hidden components such as
row IDs, object IDs, or hidden
timestamps].
—Chris Date, "What First Normal Form Really Means", pp. 127–8[4]
First two are guaranteed in any modern RDBMS.
Duplicate rows are possible in modern RDBMS - however, only if you don't have primary keys (or other unique constraints).
The fourth one is the hardest one (and depends on the semantics of your model) - for example your field Customer_address might be breaking 1NF. Might be, because if you make a contract with yourself (and any potential user of the system) that you will always look at the address as a whole and will not want to separate street name, street number and or floor, you could still claim that 1NF is not broken.
It would be more proper to break the customer address, but there are complexities there with which you would then need to address and which might bring no benefit (provided that you will never have to look a the sub-atomic part of the address line).
The fifth one is broken by some modern RDBMs, however the real importance is that your model nor system should depend on hidden elements, which is normally true - even if your RDBMS uses OIDs internally for certain operations, unless you start to use them for non-administrative, non-maintenance tasks, you can consider it not breaking the 1NF.
The strengths of relational databases come from separating information into different tables. One useful way of looking at tables is first to identify as entity tables those concepts which are relatively permanent (in your case, probably Pizza, Customer, Topping, Deliveryguy). Then you think about the relations between them (in your case, Order, Delivery ). The relational tables link together the entity tables by having foreign keys pointing to the relevant entities: an Order has foreign keys to Customer, Pizza, Topping); a Delivery has foreign keys to Deliveryguy and Order. And, yes, relations can link relations, not just entities.
Only in such a context can you achieve anything like normalization. Tossing a bunch of attributes into one singular table does not make your database relational in any meaningful sense.

Basic question: how to properly redesign this schema

I am hopping on a project that sits on top of a Sql Server 2008 DB with what seems like an inefficient schema to me. However, I'm not an expert at anything SQL, so I am seeking for guidance.
In general, the schema has tables like this:
ID | A | B
ID is a unique identifier
A contains text, such as animal names. There's very little variety; maybe 3-4 different values in thousands of rows. This could vary with time, but still a small set.
B is one of two options, but stored as text. The set is finite.
My questions are as follows:
Should I create another table for names contained in A, with an ID and a value, and set the ID as the primary key? Or should I just put an index on that column in my table? Right now, to get a list of A's, it does "select distinct(a) from table" which seems inefficient to me.
The table has a multitude of columns for properties of A. It could be like: Color, Age, Weight, etc. I would think that this is better suited in a separate table with: ID, AnimalID, Property, Value. Each property is unique to the animal, so I'm not sure how this schema could enforce this (the current schema implies this as it's a column, so you can only have one value for each property).
Right now the DB is easily readable by a human, but its size is growing fast and I feel like the design is inefficient. There currently is not index at all anywhere. As I said I'm not a pro, but will read more on the subject. The goal is to have a fast system. Thanks for your advice!
This sounds like a database that might represent a veterinary clinic.
If the table you describe represents the various patients (animals) that come to the clinic, then having properties specific to them are probably best on the primary table. But, as you say column "A" contains a species name, it might be worthwhile to link that to a secondary table to save on the redundancy of storing those names:
For example:
Patients
--------
ID Name SpeciesID Color DOB Weight
1 Spot 1 Black/White 2008-01-01 20
Species
-------
ID Species
1 Cocker Spaniel
If your main table should be instead grouped by customer or owner, then you may want to add an Animals table and link it:
Customers
---------
ID Name
1 John Q. Sample
Animals
-------
ID CustomerID SpeciesID Name Color DOB Weight
1 1 1 Spot Black/White 2008-01-01 20
...
As for your original column B, consider converting it to a boolean (BIT) if you only need to store two states. Barring that, consider CHAR to store a fixed number of characters.
Like most things, it depends.
By having the animal names directly in the table, it makes your reporting queries more efficient by removing the need for many joins.
Going with something like 3rd normal form (having an ID/Name table for the animals) makes you database smaller, but requires more joins for reporting.
Either way, make sure to add some indexes.