I dont get DB normalization - isn't repeating FKs also repeating? - sql

Well, I have just heard about that today but I do not get it. So I should not have Transaction table with Date column (because more transactions can occur at the same day) but I should have a Transaction and a Date column, where a Date would have a FK to a transaction. What is the point then, instead of a date I will repeat FK.
an example: A broker can make a transaction at any date. (transaction then needs to hold broker and date information).

Check out:
http://en.wikipedia.org/wiki/Database_normalization#Normal_forms
Transaction date does not need to be normalized.
But, imagine that Transaction is tied to customers, and customer details also have to be kept - this is a case where normalization helps to reduce data redundancy.

Assuming your date table is like the period table in our data warehouse, it is probably structured something like this:
Field date, datatype date (not datetime) primary key
other fields include fiscal year and holiday information
Then your transaction table might resemble something like this:
broker_id, foreign key to broker
date, foreign key to date
transaction time
other fields as necessary
Your question was, "what's the point?". This sort of database design allows you to easily answer questions like, "give me broker x's stats for the past 5 fiscal years, broken down by fiscal period"

Normalizing scalar values (dates, numbers, etc.) is generally overkill. Just because values repeat doesn't mean they should be normalized out. Only repeating values that aren't directly related to the row's primary key (e.g. an Address) should be candidates for normalization.
The only benefit I can see to normalizing dates if you want to add different representations of each date (e.g. Month, Quarter, etc.) without having to do the math each time. Otherwise the drwabacks outweigh the advantages in my opinion.

Moving a date attribute into another table and making it a foreign key in the Transaction table has nothing to do with "normalization".
Consider the example relation:
T{TransactionId,Date}
and dependency
{TransactionId}->{Date}
If TransactionId is a key then T already satisifies 6th Normal Form. Moving Date into another table, replacing it with another attribute and/or making it a foreign key will not make T any "more normalized" than it is already.
Whether or not attribute values "repeat" is irrelevant in normalization. What matter are the functional dependencies and join dependencies you mean to satisfy in your database schema. "Repeating data" is a phrase sometimes used informally to describe what functional dependencies are about but it is an oversimplification to say that decomposition is required simply because data repeats.

Using the date is not an ideal example. Think instead of customer records, tied to each transaction. You want to store the FK of a customer within each transaction row. You don't want to store the customer's name, address, password repeatedly though!

Related

When to split up tables in SQL?

I'm relatively new to SQL and trying to teach myself and I'm having a hard time understanding when to keep a column and when to separate it into a new table.
I was watching a lecture where the instructor had a 'Customer' table and one of the columns was 'City' and a lot of the customers were from the same city so the data was redundant. He then broke off 'City' into its own table but that didn't quite make sense to me.
For example, I'm creating a College Course DB and I noticed that certain columns in the 'Course' dimension are repeating very often (like credit hours). Should I break credit hours off into its own table where that table would only have a couple of rows? What does this accomplish? I would still have to use a foreign key to reference the same value for every new data entry so would it even save on storage or would it just be an unnecessary join?
I have other columns as well like 'Days of Week', 'Location', 'Class Time' which also only have a few values that repeat often. Should those be broken off into their own separate tables or be left part of the Course table?
This is always tricky when you are learning databases. The rules of normalization can help, but they can be unclear on when to apply them.
The idea is that (some) database tables represent "entities". These are things you want to store information about. Other tables represent relationships between/among entities, but let's not worry about those for now.
For your specific questions:
"credit hour seem more like an attribute of the course entity. When would they be their own entity? Well, if they had other information specific to being credit hours. It is hard to come up with examples, but for instance: cost, range of effective dates, departments where the credits apply.
"days-of-weeks". If this is for a date, then just use a date and derive the day of the week using database functions for the date or a calendar table.
"days-of-weeks" for scheduling. This one is trickier. There are multiple ways to represent this; the best representation depends on how it is being used.
"location". This sounds like an entity. It could have a name, address, contact, directions and other information. In fact, there could be more than one entity to support this.
"class time". This is probably an attribute of the course, with a start time and end time.
Think about using a Course table and a schedule table. This way you can have one course with many different schedules having different times and days of the week. If there are different locations I would move the location into the schedule table. Format the times correctly so you can calculate duration and cast them if needed. It depends on what the data looks like.
course
PK course_ID (int)
credit_hours (int)
location (varchar)
|
(one to many relationship)
|
schedule
PK schedule_ID (int)
FK course_ID (int)
day_of_week (varchar)
start_time (varchar)
end_time (varchar)

Update and delete records in the fact table

I have a fact table with five dimension tables associated to it.Typically, the fact table contains the surrogate keys of each dimension and has no business/surrogate key. I am trying to load the fact table with data resulted of the staging fact table i.e.Insert new records. However, I notice the fact table can also handle other operations such as Update or Delete on data. A conditional split was used in the SSIS Package for this purpose to check if all surrogate keys are 0 then make the new insert. My question is, Can I use the surrogate keys in terms of Update or Delete?
I made an insert on the fact table just to give an idea of how the data will look like.
The answer is yes, you can. BUT, will there be a situation where one employee sold the same product, from the same supplier, to the same customer, on the same day? Perhaps a different order on the same day? (this is based on the data you present in the question)
If all the surrogate keys together can uniquely identify a record, update fact records to your hearts content. But, if that is not the case, you could end up updating records when you do not intend to update.
I tend to include an order number in the fact tables I design to help avoid that situation, but you may not have that in your actual fact tables. Including the order number is a pattern referred to a degenerate dimension in the fact table. I have found it to be pretty handy.
Anyway, the answer is the same. You can update fact records based on surrogate keys, as long as all of them together can uniquely identify the row(s) you want to update.
Don't throw caution to the wind, be sure your data warehouse is designed such that you can do this if you need to. Being able to do in place updates of facts can be nice, versus delete and replace, in that there could be fewer steps in the ETL process.

SSAS - Dimension and Fact tables historical data - Mapping fact table with dimension table

i have designed places related warehouse tables - DimPlaces, FactPlaces, DimGeography. It is straightforward design if you see. All the locations is in DimPlaces (Addrline1, Addrline2,placename,etc) and geography hierarchy is in DimGeography (City, State, Country, PostCode). FactPlaces is table which has got foriegn keys to DimPlaces and DimGeography.
I would like to maintain historical data as there are chances that places names or their properties might change and at the same time if the location of a place changes then geographic hierarchy key changes.
I have found design pattern -
Another useful design pattern is to add the durable account key to the fact table in addition to the dimension’s surrogate key. This joins back to the current rows in the dimension to make it easier to report all of history by the current dimension attributes.
Could you please suggest is this OK to follow this solution? If yes, do i need to use KEY of type UNIQUEIDENTIFIER for a unique value?
Another question on this - I have employees data (DimEmployee and FactEmployee). Each employee is associated with the places where he works. How to connect These EMPLOYEE TABLES with the PLACES TABLES. Do I need to connect FACTEMPLOYEE WITH FACTPLACES?
I think in the first instance, they're referring to business keys? So if your dimension table has two rows, surrogate key 1 & 2, but they both refer to the same thing, so both have AccountId/ProductId/WhateverId of 1, then you will have some fact table rows with surrogate key 1 and business key 1, and later ones with surrogate key 2 and business key 1.
Uniqueidentifiers are very wide, try and avoid using them on fact tables and for joins if possible.
For your last question - That's really more a reporting thing. Do you need to do that? Is that what people need to see, do they need to slice by that? You could consider a referenced dimension - Where the places table links to the fact tables via a placeId on the employees dimension. Or, you could have a factemployees table with start and stop dates. It depends on what you need to achieve.

database: summarizing data which expires

I'm struggling to find an efficient and flexible representation for my data. We have a many-to-many relationship between two entities which have arbitrary lifetimes. Let's call these Voter and Candidate. Each relationship has a measurement which we'd like to summarize in various ways. These are timestamped and are guaranteed to be within the lifetime of the two related entities. Let's say the measure is approval rating, or just Rating.
One unusual requirement is that if I'm summarizing a period which has no measurement, I should substitute the latest valid measurement, rather than giving NULL.
Our current solution is to compile a list of valid voters and candidates for each day, then formulate a many-to-many table which records the latest valid measure.
What would your solution be?
This allows me to do a single query to get a daily summary:
select
avg(rating), valid_date, candidate_SSN, candidate_DOB
from
daily_rating natural join rating
group by
valid_date, candidate_SSN, candidate_DOB
This might work ok, but It seems inefficient to me. We're repeating a lot of data, especially if nothing happens for a given day. It also is unclear how to do weekly/monthly summaries without compiling even more tables. Since we're dealing with millions of rows (we're not really talking about voter polls...) I'm looking for a more efficient solution.
I have used data-warehousing technique here, hence the dim and fact table names.
dimDate is so-called date dimension, one row per a date.
dimCandidate has all candidate data, new and old records. In data-warehousing terms this is called type 2 dimension. One candidate can have several rows in this table, only one of them having r_status = 'current'.
Fields
, r_valid_from date
, r_valid_to date
, r_version integer -- (1, 2, 3,..)
, r_status varchar(10) -- (expired, current)
describe a record (row) status. Each time a candidate status changes, a new row is inserted and the pervious row's r_valid_to and r_status are modified.
CandidateFullName is a business (natural) key and has to uniquely identify a candidate. No two candidates can have the same CandidateFullName. Note that the CandidateKey uniquely identifies a row in the table, while CandidateFullName uniquely identifies a candidate.
dimVoter has voter data, new and old records -- just like the dimCandidate.
dimCampaign describes campaign details, this is so-called type one dimension, does not hold historical data.
factRating has the Rating measure.
Normaly this would be enough, but there is the reqirement to interpolate the missing data for a day; for that, an aggregate table aggDailyRating is introduced. At the end of a day, a scheduled job aggregates ratings for the day. This job takes care of the data-interpolation requirement.
This way the aggregate table has one row for each date-(valid) candidate-campaign combination. Note that voter is not included in the combination, data is aggregated over all voters.
Any reporting is done on the aggregate table, for example
--
-- monthy rating for years 2009-2010
-- for candidate john_smith_256
--
select
CalendarYear
, MonthNumber
, avg(DailyRating) as AverageRating
from aggDailyRating as f
join dimDate as d on d.DateKey = f.DateKey
join dimCandidate as c on c.CandidateKey = f.CandidateKey
where CandidateFullName = 'john_smith_256'
and CalendarYear between 2009 and 2010
group by CalendarYear, MonthNumber
order by CalendarYear desc, MonthNumber desc ;
Yes, that is very inefficient and wasteful. It is merely a set of files, not reasonably comparable to a set of "tables" or a "database"; extensions and enhancements to it will compound the duplication and inefficiency. Duplication is the antithesis of a database. In database terms, there are far more efficient and easier ways to implement that.
Assumption
Your post does not provide much info, so I have had to make some assumptions, but I think you can correct my submission quite easily if any of them are incorrect. Otherwise comment, and I will correct my submission.
A Voter is a Person; a Candidate is a Voter; (Candidate = subset of Voter)
A Campaign is related to Candidate (not to a Polling Campaign).
A Poll is a survey of the Voters response to a Candidate's performance, staring on a set date, running over a few days, and completing on an set date.
There are many Measures, such as ApprovalRating, that are surveyed in each Poll.
The Measures of such surveys across all Voters are aggregated at the Poll level.
Limitation
The expiry requirement is unclear, so I am not suggesting I have implemented that. If the model does not provide that for you (if it is not immediately obvious), supply details and I will add to the model. The current model provides exclusion/inclusion capability for what I understand the expiry requirement to be.
The Poll::Measure does not have enough info to be implemented fully; I need further details. The submission is primitive and unconstrained in that area.
Likewise, any Poll::Campaign relation or constraint ("there are many Polls per Campaign, and they are always related to Campaign") has not been implemented.
The arrangement of the key in the child tables is arbitrary for now: if you identify the most common queries, it can be re-arranged, so that the most those obtain the best speed.
Submission
Campaign Poll Data Model
This is just a Relational (Normalised; zero duplication) Database, pure IDEF1X, including provision for the consideration that the child tables will be huge: migration of narrow surrogate keys into the child tables, avoiding migration of wide keys.
It provides "data warehouse" capability as is. In fact, if it does not provide any BI or DSS requirement in a single query, that is only due to lack of detail from you; please provide, and I will happily change it. (Note, your item re "single query" is actually "single file"; joins are pedestrian in a Relational database.)
Keys such as %Code are 2-, 3-, and at most 4-characters. Such keys are just as fast as Integer keys, and very helpful (makes sense) when perusing the tables (without having to join the parent).
Any and all aggregation, either to load the historic rows, or to produce aggregates for the current values, should be possible in a single Relational (set-oriented) command; you should not need to resort to serial (cursor) processing. Again, if you think you need to, please comment and I will provide the set-oriented method.
We implement Versioning in DBs quite differently to the way it is done in DWs, and without limitations. Please identify if you require versioning of (eg) Candidate, and I will provide.
Last, the Null requirement is not unusual. It is catered for here. Again, if you think it isn't ...

How to avoid creating a date island in QlikView?

I'm a beginner developer and I have a database which has several different dates.
Created Date
Converted Date
Lost Date
Changed Date
etc.
The data needs to be shown in one application and filter on all dates. I am coding in QlikView and I could create and date island and use their native set analysis to use filter the data, but that is having a major impact on performance.
Anyone coding in QlikView come across a similar scenario?
Set analysis indeed has a major impact on performance. You are better off using the normal 'selection' functionality in QlikView.
For the answer below I am going to assume that you are familiar with the concept of Star Schema development. In short it means separating Dimensions (selection fields) from Fact fields (counter fields, summation fields, etc.) and connecting them via a link table.
There are two possible scenarios:
1. More than one date is related to the same fact.
For example you have a ´sales transactions´ table which has as a fact the amount of money involved in the sale, and there is not only the ´sale date´ but also the ´payment date´ and you want to select on both. In this case you want to have several independent date selections, since you cannot be sure whether the user wants to select on Converted date, Created date... etc. You need to duplicate your ´date island´ with different keynames and connect it to your transactions table twice. Both date pools will no longer be islands and are more properly called ´Calendar dimensions´.
2. Different dates are related to different facts.
In this case you can use one 'Calendar dimension' to accommodate for all date fields. Simply create one AutoNumber key in your calendar and call it %DateKey. Make this field the connection between your calendar table and your link table. Now for all Fact Tables that have a date which you want to make selectable with the calendar, make sure you connect it to the linktable using a key that includes the Date in the Autonumber hash.
Having it experienced this same what i would reccomend would be creating what i call a Key Table like the example below ; keeps the relationships and you don't have to use set analysis as much; just make sure you put a table with all posible dates as one of the child tables and a %DateKey like littlegreen suggested