MDX Bridge Results (Left Join) - ssas

I'm new to data warehousing so I may be approaching this the wrong way and if so please let me know a better alternative. The following is an example using the same conceptual relationships but different names.
I have a dimension of faculty and I have a bridge (many to many) connecting those faculty to their specialties. A faculty member can have more than one specialty but they may have none. When I perform MDX queries and pull the specialty and facualty member the results are showing perfectly fine but I can't seem to figure out the best way to find faculty members that have no specialty and combine them together with the ones that do. Here is a quick snapshot of the results of a mdx query I want:
name specialty Salary (fact)
James Biology 300
James Bio-diversity 300
Henry Mathmatics 350
George NULL 100
Louis Linguistics 240
etc...
This is what I'm getting from my current query:
name specialty Salary (fact)
James Biology 300
James Bio-diversity 300
Henry Mathmatics 350
Louis Linguistics 240
If I take out the bridge relationship specialty then George shows up fine. Any help or suggestions?

I would add a member named e. g. "none" to the specialty dimension. Then I would add entries to the bridge table for all faculty members which have no specialty that reference this dimension entry.
To technically implement this in detail, there are several ways:
You can change your ETL process to add these records to the tables, which is the cleanest way.
You could also use views instead of the bridge and dimension table in your Data Source View, and the views contain the logic to add these records, probably using some kind of WHERE NOT EXITSTS logic.
And finally, instead of using views, you also could use named queries in the Data Source View instead of the views, but implementing the same logic, just on another level.
That last two implementations would have the advantage that the ETL process need not be changed.

Related

using one to many over many to many sql

I'm designing a database model where there are agents that can have customers.
From a best practice standpoint, I'd like to know what is best relationship to use.
The thing is, a customer could be working with multiple agents. If we want to consider that the customer should be treated as if they are being worked from a different angle, is it best practice to design the model as a one to many relationship instead of a many to many?
In otherwords, if Agent A and Agent B are working with John Doe, should we treate John Doe as separate entities for each agent, even though the record of John Doe may be the same (think like contact details).
It sounds like you simply want a junction/association table with columns such as:
CustomerId
AgentId
You may also want dates, descriptions and other information describing the relationship being worked on.
You mean to have agent_customer table? Which has the auto increment ID as PK, agentID and customerID so one agent can have multiple customers and a customer can have multiple agents. Make sure to make agentID and customerID unique to avoid redundancy.

Handling Many to many , 1 to many relation ship between dimensions

I have a scenario where one sales guy is related to more than one departments, and I need to calculate the sales at sales rep level and department level. Please share the thoughts on how it can be modelled
My thought process is below
Option 1
I will be creating as 'Sales Rep' dimension and 'Department' dimension and connected it with a bridge table which has dept_id and sales rep_id
Here both the dimensions I prefer to have the history so it is SCD type 2
Option 2
I will be creating 'Sales Rep' dimension and 'Department' dimension and in department dimension, I will be adding the filed " sales rep id". which connects the Sales rep with Department.T he drawback I have observed here is Department details will be repeating in 'Department' table for each employee.
Here both the dimensions I prefer to have the history so it is SCD type 2
Please share your answer, the above options which one is better, or any other third best approach -
This answer is related to the business model more than to technological needs:
Options 2 makes the best sense if the sales person could belong to more than one department, keep the department at the "sales" fact table, and then no need to keep the department in the "sales person" dimension.
Option 1 makes the best sense if the sales person belongs only to one department at a time, but he might change departments, make this a Slowly Changing Dimension Type 2 in which you keep the history.
Slowly changing dimension means you don't need a bridge table, the department is part of the "sales person" table, and you can read more about it in the link provided.
In the odd case that a sales person can work in several departments and have people from various departments reporting to him, then all the hierarchical model should be in a different table. In SSAS a self-reflecting table doesn't work well, try to check ways in which to flatten those issues.
Please note that when you're designing a data warehouse the star schema means exactly that: data might repeat itself in different tables in order to make the reporting easier.
Those issues never have a clear cut solution and I advise you to read as much as you can on data warehouse design until your head spins in order to get your head around this.

Normalizing database tables

I am quiet new in database designing, I am trying one test case to track students.
In below image, student can either be in school or club. For this I have create on LocationId which act as a global id for where ever the student is.
But the problem is I am depending on TypeId to determine if its Club or school.
So in my data access query I have to make cases. Pseudo code is :
if TypeId == 1
search in club for the LocationId and get the clubId.
else if TypeId == 2
search in school for the LocationId and get the schoolId.
How can I get rid of these cases and still be maintaining the normalized rule.
Thanks a lot for reading. Any comments are welcome.
Good day!
This seems to be a case of table inheritance and there is more than one way to solve it. Your solution with LOC_CONTAINER doesn't work (as you have noticed) as it requires outside code to do the checking.
Take a look at this comprehensive answer about inheritance. You could for example unify SCHOOL and CLUB tables into one table called PLACE or alternatively have both SCHOOL and CLUB columns in the table STUDENT with a constraint that one of them has to be NULL.

At what point does data normalization become ludicrous?

I often find myself questioning whether I'm taking the right approach in trying to plan for future expansibility when creating databases and relations.
I have the following situation:
I have a Donor table and a Recipient table. Both tables share common information such as first_name, last_name, email_address, date_of_birth, etc. Both seem to, if you'll excuse my object-oriented language, share a common abstract type of Person. It's possible that someone who is at one point a Recipient may later become a Donor by means of giving a donation, so it's important that information isn't duplicated across tables. Should I opt for an inheritance pattern, or should I just foreign key Donors and Recipients to a Person table?
Initially, I was thinking of simply mapping properties like email_address and street address properties directly into the things that need them, but then the possibility may arise that a person would have multiple email addresses or mailing addresses (ie: home, work, etc.). What that means is that we have a model somewhat like this:
create table person(id int primary key auto increment, ...,
default_email_address);
create table email_address(id int primary key auto increment,
email varchar(255), name varchar(255), is_default bool, person_id int);
This makes things a bit complicated, as you can imagine. The name field also involves a list of default values as well as allowing custom input. I can't just make it an enum field, because the possibility exists that someone will have a lot of emails to add that could all be different... (this is the point at which I scream out "IS IT EVEN WORTH IT ANYMORE!?!?" and get frustrated with the project)
I guess what this really boils down to is the following: at what point does data normalization become ludicrous? My goal here is to create a really good as-forward-compatible-as-possible data model that I won't kick myself for creating later.
at what point does data normalization become ludicrous?
At the point that it stops modelling the actual requirements.
To take your examples:
With the Donor and Recipient tables, if it is highly likely that any one person will become both, then it does make sense to separate out to a Person entity. If this is rare, it doesn't.
With the email_address and street_address situations, it depends whether you do need to store multiples or not (what is the expectation?). You may want to store separate versions per business unit (say shipping_address vs billing_address).
I think the problem is not in your implementation, but rather in your analysis of the problem. Donor and Recipient are not first-class actors, they are roles of the actors. If you model them as such, you'd get a somewhat cleaner model:
You'd have a person table with addresses and so on
You'd also have an address table with addresses of the people
You'd also have a person_role table, with the role code (donor, recipient) and other relevant information. You may want to get fancy, and add person_donor and person_recipient, with a foreign key into the person table.
Short answer: Normalization never becomes ridiculous. Most of what you're doing isn't normalization.
Longer answer
The "worst" (in truth, the "best) most designers can practically do is end up with all tables in 5NF. 5NF isn't ridiculous at all. (Yes, I know about 6NF. I'm ignoring it for didactic reasons.)
questioning whether I'm taking the right approach in trying to plan
for future expansibility
That's a good question to ask yourself. It has nothing to do with normalization, though. At the conceptual level, normalization is something you do after you've decided what attributes (columns) and data need to go into your database. Experienced database designers often "think in 3NF", choosing attributes, data, and normalizing all at the same time, more or less.
Should I opt for an inheritance pattern, or should I just foreign key
Donors and Recipients to a Person table?
Donors and recipients aren't different types of people. Donors are people who have made a donation. Recipients are people who have received something.
id fullname don_date don_amt recip_date recip_amt
--
1 Jamie Hubbert 2012-01-13 $20.00
1 Jamie Hubbert 2012-02-13 $17.00
2 Kelly Hawkin 2012-01-13 $50.00
2 Kelly Hawkin 2012-01-13 $20.00
3 Neva Papke 2012-01-13 $15.00
3 Neva Papke 2012-02-13 $15.00
2 Kelly Hawkin 2012-01-13 $10.00
4 Jamie Hubbert 2012-01-13 $10.00
4 Jamie Hubbert 2012-02-13 $10.00
During normalization, you'd identify these dependencies. (For simplicity, assumes one donation per person per date.)
person_id -> person_name
person_id -> email
person_id, donation_date -> donation_amount
person_id, recip_date -> recip_amount
Normalize to 5NF, and you'd get these three tables.
Persons
--
1 Jamie Hubbert
2 Kelly Hawkin
3 Neva Papke
4 Jamie Hubbert
Donations
--
1 2012-01-13 $20.00
1 2012-02-13 $17.00
2 2012-01-13 $50.00
2 2012-01-13 $20.00
4 2012-01-13 $10.00
Receipts (?)
--
3 2012-01-13 $15.00
3 2012-02-13 $15.00
2 2012-01-13 $10.00
4 2012-02-13 $10.00
Initially, I was thinking of simply mapping properties like
email_address and street address properties directly into the things
that need them, but then the possibility may arise that a person would
have multiple email addresses or mailing addresses (ie: home, work,
etc.).
Deciding whether to support multiple email addresses, multiple mailing addresses, and different mailing and delivery addresses is a significant design decision. But it has nothing to do with normalization. Normalization, again, is something you do after you've decided which attributes and data belong in your database. So, if you were collecting representative sample data, you might end up with one of these two sets of email addresses.
Set A
1 Jamie Hubbert jhubbert#somedomain.com
4 Jamie Hubbert jamie.hubbert#this.com
Set B
1 Jamie Hubbert jhubbert#somedomain.com
1 Jamie Hubbert jamie#my.com
4 Jamie Hubbert jamie.hubbert#this.com
In set A, person_id->email. In set B, it doesn't. Choosing to support the data in set A or the data in set B is a big decision, and it strongly affects what you end up with after normalizing to 5NF. But deciding which set to support has nothing to do with normalization.
As an aside, choosing to assign id numbers to non-unique email addresses is another big (and questionable) design decision. Like others, this decision has nothing to do with normalization.
(Random names courtesy of The Random Name generator.)
I would put all the shared data into a Person table. The Donor and Recipient tables should only contain data that are specific to each, and should have foreign keys pointing back to the primary key of Person.
This isn't ludicrous normalization at all; it's actually pretty common practice.

Normalization in plain English

I understand the concept of database normalization, but always have a hard time explaining it in plain English - especially for a job interview. I have read the wikipedia post, but still find it hard to explain the concept to non-developers. "Design a database in a way not to get duplicated data" is the first thing that comes to mind.
Does anyone has a nice way to explain the concept of database normalization in plain English? And what are some nice examples to show the differences between first, second and third normal forms?
Say you go to a job interview and the person asks: Explain the concept of normalization and how would go about designing a normalized database.
What key points are the interviewers looking for?
Well, if I had to explain it to my wife it would have been something like that:
The main idea is to avoid duplication of large data.
Let's take a look at a list of people and the country they came from. Instead of holding the name of the country which can be as long as "Bosnia & Herzegovina" for every person, we simply hold a number that references a table of countries. So instead of holding 100 "Bosnia & Herzegovina"s, we hold 100 #45. Now in the future, as often happens with Balkan countries, they split to two countries: Bosnia and Herzegovina, I will have to change it only in one place. well, sort of.
Now, to explain 2NF, I would have changed the example, and let's assume that we hold the list of countries every person visited.
Instead of holding a table like:
Person CountryVisited AnotherInformation D.O.B.
Faruz USA Blah Blah 1/1/2000
Faruz Canada Blah Blah 1/1/2000
I would have created three tables, one table with the list of countries, one table with the list of persons and another table to connect them both. That gives me the most freedom I can get changing person's information or country information. This enables me to "remove duplicate rows" as normalization expects.
One-to-many relationships should be represented as two separate tables connected by a foreign key. If you try to shove a logical one-to-many relationship into a single table, then you are violating normalization which leads to dangerous problems.
Say you have a database of your friends and their cats. Since a person may have more than one cat, we have a one-to-many relationship between persons and cats. This calls for two tables:
Friends
Id | Name | Address
-------------------------
1 | John | The Road 1
2 | Bob | The Belltower
Cats
Id | Name | OwnerId
---------------------
1 | Kitty | 1
2 | Edgar | 2
3 | Howard | 2
(Cats.OwnerId is a foreign key to Friends.Id)
The above design is fully normalized and conforms to all known normalization levels.
But say I had tried to represent the above information in a single table like this:
Friends and cats
Id | Name | Address | CatName
-----------------------------------
1 | John | The Road 1 | Kitty
2 | Bob | The Belltower | Edgar
3 | Bob | The Belltower | Howard
(This is the kind of design I might have made if I was used to Excel-sheets but not relational databases.)
A single-table approach forces me to repeat some information if I want the data to be consistent. The problem with this design is that some facts, like the information that Bob's address is "The belltower" is repeated twice, which is redundant, and makes it difficult to query and change data and (the worst) possible to introduce logical inconsistencies.
Eg. if Bob moves I have to make sure I change the address in both rows. If Bob gets another cat, I have to be sure to repeat the name and address exactly as typed in the other two rows. E.g. if I make a typo in Bob's address in one of the rows, suddenly the database has inconsistent information about where Bob lives. The un-normalized database cannot prevent the introduction of inconsistent and self-contradictory data, and hence the database is not reliable. This is clearly not acceptable.
Normalization cannot prevent you from entering wrong data. What normalization prevents is the possibility of inconsistent data.
It is important to note that normalization depends on business decisions. If you have a customer database, and you decide to only record a single address per customer, then the table design (#CustomerID, CustomerName, CustomerAddress) is fine. If however you decide that you allow each customer to register more than one address, then the same table design is not normalized, because you now have a one-to-many relationship between customer and address. Therefore you cannot just look at a database to determine if it is normalized, you have to understand the business model behind the database.
This is what I ask interviewees:
Why don't we use a single table for an application instead of using multiple tables ?
The answer is ofcourse normalization. As already said, its to avoid redundancy and there by update anomalies.
This is not a thorough explanation, but one goal of normalization is to allow for growth without awkwardness.
For example, if you've got a user table, and every user is going to have one and only one phone number, it's fine to have a phonenumber column in that table.
However, if each user is going to have a variable number of phone numbers, it would be awkward to have columns like phonenumber1, phonenumber2, etc. This is for two reasons:
If your columns go up to phonenumber3 and someone needs to add a fourth number, you have to add a column to the table.
For all the users with fewer than 3 phone numbers, there are empty columns on their rows.
Instead, you'd want to have a phonenumber table, where each row contains a phone number and a foreign key reference to which row in the user table it belongs to. No blank columns are needed, and each user can have as few or many phone numbers as necessary.
One side point to note about normalization: A fully normalized database is space efficient, but is not necessarily the most time efficient arrangement of data depending on use patterns.
Skipping around to multiple tables to look up all the pieces of info from their denormalized locations takes time. In high load situations (millions of rows per second flying around, thousands of concurrent clients, like say credit card transaction processing) where time is more valuable than storage space, appropriately denormalized tables can give better response times than fully normalized tables.
For more info on this, look for SQL books written by Ken Henderson.
I would say that normalization is like keeping notes to do things efficiently, so to speak:
If you had a note that said you had to
go shopping for ice cream without
normalization, you would then have
another note, saying you have to go
shopping for ice cream, just one in
each pocket.
Now, In real life, you would never do
this, so why do it in a database?
For the designing and implementing part, thats when you can move back to "the lingo" and keep it away from layman terms, but I suppose you could simplify. You would say what you needed to at first, and then when normalization comes into it, you say you'll make sure of the following:
There must be no repeating groups of information within a table
No table should contain data that is not functionally dependent on that tables primary key
For 3NF I like Bill Kent's take on it: Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key.
I think it may be more impressive if you speak of denormalization as well, and the fact that you cannot always have the best structure AND be in normal forms.
Normalization is a set of rules that used to design tables that connected through relationships.
It helps in avoiding repetitive entries, reducing required storage space, preventing the need to restructure existing tables to accommodate new data, increasing speed of queries.
First Normal Form: Data should be broken up in the smallest units. Tables should not contain repetitive groups of columns. Each row is identified with one or more primary key.
For example, There is a column named 'Name' in a 'Custom' table, it should be broken to 'First Name' and 'Last Name'. Also, 'Custom' should have a column named 'CustiomID' to identify a particular custom.
Second Normal Form: Each non-key column should be directly related to the entire primary key.
For example, if a 'Custom' table has a column named 'City', the city should has a separate table with primary key and city name defined, in the 'Custom' table, replace the 'City' column with 'CityID' and make 'CityID' the foreign key in the tale.
Third normal form: Each non-key column should not depend on other non-key columns.
For example, In an order table, the column 'Total' is dependent on 'Unit price' and 'quantity', so the 'Total' column should be removed.
I teach normalization in my Access courses and break it down a few ways.
After discussing the precursors to storyboarding or planning out the database, I then delve into normalization. I explain the rules like this:
Each field should contain the smallest meaningful value:
I write a name field on the board and then place a first name and last name in it like Bill Lumbergh. We then query the students and ask them what we will have problems with, when the first name and last name are all in one field. I use my name as an example, which is Jim Richards. If the students do not lead me down the road, then I yank their hand and take them with me. :) I tell them that my name is a tough name for some, because I have what some people would consider 2 first names and some people call me Richard. If you were trying to search for my last name then it is going to be harder for a normal person (without wildcards), because my last name is buried at the end of the field. I also tell them that they will have problems with easily sorting the field by last name, because again my last name is buried at the end.
I then let them know that meaningful is based upon the audience who is going to be using the database as well. We, at our job will not need a separate field for apartment or suite number if we are storing people's addresses, but shipping companies like UPS or FEDEX might need it separated out to easily pull up the apartment or suite of where they need to go when they are on the road and running from delivery to delivery. So it is not meaningful to us, but it is definitely meaningful to them.
Avoiding Blanks:
I use an analogy to explain to them why they should avoid blanks. I tell them that Access and most databases do not store blanks like Excel does. Excel does not care if you have nothing typed out in the cell and will not increase the file size, but Access will reserve that space until that point in time that you will actually use the field. So even if it is blank, then it will still be using up space and explain to them that it also slows their searches down as well.
The analogy I use is empty shoe boxes in the closet. If you have shoe boxes in the closet and you are looking for a pair of shoes, you will need to open up and look in each of the boxes for a pair of shoes. If there are empty shoe boxes, then you are just wasting space in the closet and also wasting time when you need to look through them for that certain pair of shoes.
Avoiding redundancy in data:
I show them a table that has lots of repeated values for customer information and then tell them that we want to avoid duplicates, because I have sausage fingers and will mistype in values if I have to type in the same thing over and over again. This “fat-fingering” of data will lead to my queries not finding the correct data. We instead, will break the data out into a separate table and create a relationship using a primary and foreign key field. This way we are saving space because we are not typing the customer's name, address, etc multiple times and instead are just using the customer's ID number in a field for the customer. We then will discuss drop-down lists/combo boxes/lookup lists or whatever else Microsoft wants to name them later on. :) You as a user will not want to look up and type out the customer's number each time in that customer field, so we will setup a drop-down list that will give you a list of customer, where you can select their name and it will fill in the customer’s ID for you. This will be a 1-to-many relationship, whereas 1 customer will have many different orders.
Avoiding repeated groups of fields:
I demonstrate this when talking about many-to-many relationships. First, I draw 2 tables, 1 that will hold employee information and 1 that will hold project information. The tables are laid similar to this.
(Table1)
tblEmployees
* EmployeeID
First
Last
(Other Fields)….
Project1
Project2
Project3
Etc.
**********************************
(Table2)
tblProjects
* ProjectNum
ProjectName
StartDate
EndDate
…..
I explain to them that this would not be a good way of establishing a relationship between an employee and all of the projects that they work on. First, if we have a new employee, then they will not have any projects, so we will be wasting all of those fields, second if an employee has been here a long time then they might have worked on 300 projects, so we would have to include 300 project fields. Those people that are new and only have 1 project will have 299 wasted project fields. This design is also flawed because I will have to search in each of the project fields to find all of the people that have worked on a certain project, because that project number could be in any of the project fields.
I covered a fair amount of the basic concepts. Let me know if you have other questions or need help with clarfication/ breaking it down in plain English. The wiki page did not read as plain English and might be daunting for some.
I've read the wiki links on normalization many times but I have found a better overview of normalization from this article. It is a simple easy to understand explanation of normalization up to fourth normal form. Give it a read!
Preview:
What is Normalization?
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
http://databases.about.com/od/specificproducts/a/normalization.htm
Database normalization is a formal process of designing your database to eliminate redundant data. The design consists of:
planning what information the database will store
outlining what information users will request from it
documenting the assumptions for review
Use a data-dictionary or some other metadata representation to verify the design.
The biggest problem with normalization is that you end up with multiple tables representing what is conceptually a single item, such as a user profile. Don't worry about normalizing data in table that will have records inserted but not updated, such as history logs or financial transactions.
References
When not to Normalize your SQL Database
Database Design Basics
+1 for the analogy of talking to your wife. I find talking to anyone without a tech mind needs some ease into this type of conversation.
but...
To add to this conversation, there is the other side of the coin (which can be important when in an interview).
When normalizing, you have to watch how the databases are indexed and how the queries are written.
When in a truly normalized database, I have found that in situations it's been easier to write queries that are slow because of bad join operations, bad indexing on the tables, and plain bad design on the tables themselves.
Bluntly, it's easier to write bad queries in high level normalized tables.
I think for every application there is a middle ground. At some point you want the ease of getting everything out a few tables, without having to join to a ton of tables to get one data set.