Joining fact tables in an MDX query - sql-server-2005

I am building and Anaysis Services project using VS 2005. The goal is to analyse advertising campaigns.
I have a single cube with 2 fact tables
factCampaign: which contains details of what people interviewed thought of an advertising campaign
factDemographics: which contains demographic information of the people interviewed
These fact tables have a common dimension dimRespodent which refers to the actual person interviewed
I have 2 other dimensions (I’ve left non relevant dimensions)
dimQuestion: which contains the list of questions asked
dimAnswer: which contains the list of possible answers to each question
dimQuestion and dimAnswer are linked to factDemogrpahics but not factCampaign
I want to be able to run queries to return results of what people thought about campaign (from factCampaign) but using demographic criteria (using dimQuestion and dimAnswer)
For example the how many Males, aged 18-25 recalled a particular campaign
I am new to OLAP and Analysis Services (2005) so please excuse me if what I am asking is too basic.
I have tried the following options
Linking the to factTables in the datasource view using the common RespondentKey. Queries run and return results but the same result is returned regardless of the demographic criteria chosen, i.e. it is being ignored.
Creating a dimension from factDemographics. I have tried to connect dimAnswer to factCampaign in Dimension Usage tabe of the Cube Structure but with out success. Either the project just stalls when I try to deploy it or I get the following error (note the attribute hierarchy enabled is set to true)
Errors in the metadata manager. The 'Answer Key' intermediate granularity attribute of the 'Fact Demographics' measure group dimension does not have an attribute hierarchy enabled.
I would appreciate any help that anyone can offer. Let me know if you require more info and again apologies if this is a basic question

What you probably need is a many-to-many relationship. There is a whitepaper here which goes through a number of scenarios for m2m relationships including one specifically around surveys and questionaires.

For anyone who is interested the solution was to alter the dimRespondent to include the questions and answers. And in the Dimension Usage tab of the Cube design to set dimRespondent to have a Regular relationship to both fact tables.

Related

Data Warehouse Architecture Modeling

I'm trying to Architecture creating a data warehouse in the Star Schema model... any idea would be appreciated.
Any idea what I should do to create a Star Schema? Some day that I should have a linking table with DimProjects going to the fact tables. What about Project hours? What is the right approach to this or do I need other tables to link? Employee's can work on multiple projects, projects require man hours... etc.
What is the best approach on modeling?
So far I have tables:
[CODE]
Dimension Tables Measure Tables
---------------- --------------
DimEmployee FactCRM
DimProjects FactTargets
DimSalesDetails FactRevenue
DimAccounts
DimTerritories
DimDate
DimTime
[/CODE]
Dimensions in a schema of a datewarehouse means independent entities like for say
Dim_Employee
Empid(pk)
Name
Address etc likewise all other
dimensions
With each dimension keys linked to your fact like in above case
FactCRM would include only crm
related measures and would be linled
To their specific dimensions depending
upon the requirements
Without knowing the columns noone would be able to tell what you want in actual. Also remember linking a dimension to a fact is obviously a partial star schema itself so that doesnt lead to any issues. The only thing is if your dimensions are itself normalized in a schema then it becomes snowflake.
Another thing about fact related if you want to perform manipulation of othwr facts based on somw existing facts then you have to link fact table as well with a unique factid. This is called fact constellation. Then the schema would become star/snowflake schema with facy constellation

Is there an appropriate way to show correlation or causation from a SQL query?

I'm curious if there's any way that Microsoft Access or SQL Server can provide a function to show correlation or causation from a SQL Select statement and its results.
This is basically the use case scenario:
Let's say you have two tables, table studentCourseSurveys and table onlineCourseReviews. At the forefront these two tables are mostly unrelated but could be joined based on the course name, for example "ENG 101".
studentCourseSurveys is a table that is intended to hold data that students submit during
their in person course survey at the end of a semester. For example, in the last day of class students receive that form to fill out to rate the instructor based on things such as "exams were related to the actual lecture content", "instructor was on time and prepared", and then at the end they have their short answer opportunities to give additional comments.
onlineCourseReviews is a table that is used by an internal department that conducts content reviews on the online component of courses. For example, this department has individual instructional designers who are assigned different courses in Blackboard to review. They review content, delivery, course structure, and so on and so forth. The course is then given its comments, score, etc.
As already mentioned the tables would be most unrelated. But let's say someone wanted to show a correlation that the results of the online course reviews could somehow indicate that the quality of the overall course was better because of these online reviews, and that this was shown in the responses from the students based on their survey results.(basically, a course that got an online course review score of 10 has course surveys where a great majority of the students rated it as excellent, content was relevant, teacher was prepared, etc, to indicate that an improvement in the online course translated to improvements in the in-person class and overall class quality).
This almost seems like a unique job for a statistician but I'd like to know if it's possible to show this data based on a query in Access or SQL Server. I know that you could just easily join the two tables with a foreign key and then get the results of a survey and online course review in a single statement but that doesn't really say anything. I would think that to show ANY kind of relationship that you need to illustrate a trend over any given period of time.
Thank you.
I would suggest creating a query with three values - course name (e.g., ENG101), the student rating, and the online course rating. In SQL Server you can save the results as a .csv file. Do this, then open it in Excel and use the RSQ function , or R-Squared, to find the correlation coefficient between columns two and three. The closer R-squared is to 1, the closer the two match. 1 mean a perfect correlation. 0 means no relation at all and -1 means related but but in a polar opposite manner.

sql, define Separate relation to target table or get by joins

We're working on a CMS project with EF and MVC. We've Recently encountered a problem,
Please consider these tables:
Applications
Entities
ProductsCategories
Products
Relations are in this order:
Applications=>Entities=>ProductCategories=>Products
When we select a product by it's Id, always we should check if requested ProductsId is
just for a specific application stored in Applications table, These is for preventing load other applications products,
what is the best way to get a product for specific application id, We have two choice:
Instead of define a relation between products and applications we can do joins with productsCategories,entities, and applications to find it
=> when we want to get products we don't want to know about entities or other tables that we should join it to access applications
we can define a separate relation between products and applications and get it by simple select query
which of these is the best way and why?
Manish first thanks for your comment,Then please consider this that some of our tables does not have any relation with Entities for these tables we should define a relation with Entites to access Applications or define a separate as relation as mentioned above,For these tables we just define a relation and does not have extra work,except performance issue.still some of other tables has relations with entites so for this one defining a separat relation has extra work,
At last please consider this,in fact all of tables should access 'Entities' some by separate relation and others can access from there parents
actually for relation between products and entities we didn't define a separate relation because it doesn't has performance issue,But for relation between products and entities we should consider performance issue because in every request we should access Applications to check request Id is for current Application
So what is your idea?
Let's look at your options
Instead of defining a relationship, you can join the three tables to get the correct set of products: In this case, you won't have to make any database changes and anyway, you won't be fetching all the joined tables data, you would fetch only that data, which you have specified in your Linq Select List. But then, 3-tables join can be a little performance degrading when the number of rows will be very high at some point of time
You can define a separate relationship between the two said tables: In this case you would have to change your database structure, that would mean, making changes in your Entity and Entity Model, and lot of testing. No doubt, it will mean simple code, ease of usage which is always welcome.
So you see, there is no clear answer, ultimately it depends on you and your code environment what you want to go with, as for me, I would go for creating a separate relationship between the Application and Product entity, cause that would cause a cleaner code with a little less effort. Besides as they say, "Code around your data-structure, and not the otherway around"

Survey Data Model - How to avoid EAV and excessive denormalization?

My database skills are mediocre at best and I have to design a data model for survey data. I have spent some thoughts on this and right now I feel that I am stuck between some kind of EAV model and a design involving hundreds of tables, each with hundreds of columns (and thousands of records). There must be a better way to do this and I hope that the wise folks on this forum can help me.
My question is: how should I model the answers to survey questions in an RDBMS? Using SQL Server is mandatory. So alternative data storage systems should be excluded from this discussion. (Sure, some should and will be evaluated, but not here please.) I don't need a solution for the entire data model, for now I'm only interested in the Answers part.
I have already searched various forums, but I couldn't really find a solution. If it has already been given elsewhere, please excuse me and provide me with a link so I can read it up.
Some assumptions about the data I have to deal with:
Each survey consists of 1 to n questionnaires
Each questionnaire consists of 100-2,000 questions (please ignore that 2,000 questions really sound like a lot to answer...)
Questions can be of various types: multiple-choice, free text, a number (like age, income, percentages, ...)
Each survey involves 10-200 countries (These are not the respondents. The respondents are actually people in the countries.)
Depending on the type of questionnaire, each questionnaire is answered by 100-20,000 respondents per country.
A country can adapt the questionnaires for a survey, i.e. add, remove or edit questions
The data for one country is gathered in a separate database in that country. There is no possibility for online integration from the start.
The data for all countries has to be integrated later. This means for example, if a country has deleted a question, that data must somehow be derived from what they sent in order to achieve a uniform design across all countries
I will have to write the integration and cleaning software, which will need to work with every country's data
In the end the data needs to be exported to flat files, one rectangular grid per country and questionnaire.
I have already discussed this topic with people from various backgrounds and have not come to a good solution yet. I mainly got two kinds of opinions.
The domain experts, who are used to working with flat files (spreadsheet-style) for data processing and analysis vote for a denormalized structure with loads of tables and columns as I described above (1 table per country and questionnaire). This sounds terrible to me, because I learned that wide tables are to be avoided, it will be annoying to determine which columns are actually in a table when working with it, the database will become cluttered with hundreds of tables (or I even need to set up multiple databases, each with a similar yet a bit differetn design), etc.
O-O-programmers vote for a strongly "normalized" design, which would effectively lead to a central table containing all the answers from all respondents to all questions. This table would either need to contain a column of type sql_variant type or multiple answer columns with different types to store answers of different types (multiple choice, free text, ..). The former would essentially be a EAV model. I tend to follow Joe Celko here, who strongly discourages its use (he calls it OTLT or "One True Lookup Table"). The latter would imply that each row would contain null cells for the not applicable types by design.
Another alternative I could think of would be to create one table per answer type, i.e., one for multiple-choice questions, one for free text questions, etc.. That's not so generic, it would lead to a lot of union joins, I think and I would have to add a table if a new answer type is invented.
Sorry for boring you with all this text and thank you for your input!
Cheers,
Alex
PS: I asked the same question here: http://www.eggheadcafe.com/community/aspnet/13/10242616/survey-data-model--how-to-avoid-eav-and-excessive-denormalization.aspx
Well imgur is down so i'll post the pic later.
I think this is completely feasible within a relational model. I've built a CDM to show how I would do this.
Outbound
It takes 4 entities to define a Country's Survey. Some Parent Survey, the country and a list of questions. Your questions have an internal relationship so when one country "edits" a question, you can track both the question asked by the country and the question it came from. The other thing you need is a Possible Answer entity/table. Each question may have an associated list of possible answers (multiple choice or ranges etc). Those 4 should completely define the "OUTBOUND" side of this.
Inbound
The "INBOUND" side is just 2 new entities, The Respondent and the answer. The respondent is straightforward, just the demographics of that person if you know them and here you can include a relationship back to country. Each respondent answered the survey in a given country. (Person may be 1:n with Respondent if the person travels or has dual citizenship)
The answer is basic; either it is one of the choices listed in the list of Possible Answers or it is provided. Don't get all caught up in the fact that the answer may be a number, date, etc just yet. Either it's a FK or a string of characters.
Reporting
A report is a join over all of these... You'll choose a country and a survey, get the list of questions and answers.
Answer Complexity
Depends on where you want to do your calculations. If you used a Varchar2(4000) column for your user-provided answers, you could add an attribute to question to describe the datatype of the answer. Q: Age? DT: Integer Between (0 and 130). Then your integration layer can do the validation instead of the database enforcing it. Or you can have 4 columns, one for number, date, character and CLOB. And your integration layer will determine the column to use. When you report those answers out, you'll just select all four columns with Coalesce().
Is this an EAV because there's a slight ambiguity to the datatype of "Answer"
No, it's not.
AN EAV model breaks down an Entity into a list of attributes.
like so:
Entity Attribute Value
1 Fname Stephanie
1 Lname Page
1 Age 30
because you see the Answer column of the Survey schema is holding both words and numbers like the Value column does here you think that defines EAV. It does not. Just as if I added 3 datatype columns to this model it wouldn't change it FROM an EAV.
I soooo hate it when
I've had people tell me that the query I'm tuning has to go "as fast as possible". Ok, so give me a billion dollars and 30 years. "Wait, a Billion what?" "As much as", "as fast as" aren't requirements. You can validate anything you want in a database... build a shedload of Before triggers, voila! Validation galore.
What's the datatype of an age column? Or Birthdate column? Depends on what your data source is. Some older records may only have Month and Year, or just year, or 'around' or 'circa' some year. You couldn't have just a number column and do 'as much validation as possible'. and NUMBER(2) may be BETTER validation than just NUMBER. So now you'll have NUMBER(1), NUMBER(2), NUMBER... to have "as much as".
Where I think you are getting tripped up
Think of this as a Conceptual Data Model, not a Physical one. In those terms Survey is an entity. Is Question an entity or just an attribute of Survey. If you built One table PER you're clearly saying that Question is just an Attribute of Survey and storing them vertically makes this an EAV. What this model shows is that Question is actually another entity. There is a relationship between Questions, e.g. 'a country [can] edit questions'. There was the original question and edited one. Each question has a collection of possible answers. And the most important this is that, they are all questions. In an EAV I call fname, lname, bdate, age, major, salary, etc... all very disparate things, just attributes. In this case we're not including the name of the agency who originated the survey and the date it was issued and the date is due back and the etc... as questions.
Let me put this another way. You're Fedex. You want to store timestamps for certain events. Each time a package enters or leaves a facility or vehicle. Time on the picking up truck, time off the truck and into the first facility, time out of that facility and onto a plane, etc. Do you store them Horizontally? How do you know the number of hops in advance? If you store them vertically does that automatically make it an EAV? And if so why.
You're a weather company getting temps from stations around the country. Let's say the sensors are designed to send a reading when the temperature changes +/- a full degree. If you store a sensor_ID|timestamp|temp is a Reading Table is that an EAV? Each reading isn't an attribute of the sensor, they are themselves entities which belong to a collection/series.
One thing that vertical storage of answers has in common with an EAV is its difficulty in performing analytic queries. If you wanted a list of all the people who answered TRUE to question 5 and 10 but FALSE to 6 and 11 would be very difficult when done vertically. Maybe that's why you see this an EAV. If you want to do that, you need a different storage. The relational storage of the question and answers isn't the best reporting database. Let's go back to the Fedex example. It's not simple to do "transit" time reporting when the rows are vertical.
This sounds like you are wrestling with a common problem: how to use a hammer to fasten a screw.
Both alternatives you listed are bad, each for different reasons. But that's because you are trying to stuff your particular data model into a relational database system. A good approach would be to look beyond the relational database at some other database/storage systems, try a couple out, and find the best fit for your project.
I have tried the EAV model and gave up because it was far too complex, and I am afraid to try the multi-tables model with a relational database system. The easiest solution I have found with a relational database is: store each complete response as a single CLOB, serialized into JSON or YAML (or something else lightweight), in a responses table.
create table responses (
id uuid primary key,
questionnaire_id uuid references questionnaires.id,
data text
)
If I was using SQL Server, Express will be OK, then I would do this:
Table with list of questions, flags
for type (bit), if required flag
(bit), the correct answer if exists,
etc
Table with list of countries
Table linking of countries and
questions (some countries may not get some questions
Table for answers with columns for
the question(s) and a xml
column for the optional questions
including those which are added
If you are not versed in shredding XML then use sparse columns for all the optional questions. I do not recall exactly the limit on the number of sparse columns in a table but I believe it is above 30,000. SQL Server internally stores sparse columns as XML and will shred it when one selects the column and yes it can be indexed
The diagram below show a diagram created with SQL Server. the column AL_A4 will hold the answer to QL_Id = 4 and is of type sparse. The QL_Id in the QuestionList table is not flagged required letting you know to make the column in AnswerList sparse.
Since countries will add questions create QuestionListCustom, QuestiontoCountryCustom and AnswerListCustom tables and add the information from the custom questions.
I am sure there are other ways to design the storage, this is the way I would turn in the homework, if this is not homework then you surely work for the UN.
Have you considered not reinventing the wheel? There are open source survey applications already built. Even if they don't meet your needs, download a few and check out their data models.

tools or a website to help with database normalization [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Are there any tools or online resources (FREX tutorials) that would help a neophyte with database normalization?
An Introduction to Database Normalization
Wiki: Database normalization
Database Normalization Tips
Achieving a Well-Designed Database In relational-database design theory,
normalization rules identify certain
attributes that must be present or
absent in a well-designed database.
There are a few rules that
can help you achieve a sound database
design:
A table should have an identifier. The fundamental rule of database
design theory is that each table
should have a unique row identifier, a
column or set of columns used to
distinguish any single record from
every other record in the table. Each
table should have an ID column, and no
two records can share the same ID
value. The column or columns serving
as the unique row identifier for a
table are the primary key of the
table. In the AdventureWorks database,
each table contains an identity column
as the primary key column. For
example, VendorID is primary key for
the Purchasing.Vendor table.
A table should store only data for a single type of entity. Trying to
store too much information in a table
can hinder the efficient and reliable
management of the data in the table.
In the AdventureWorks sample database,
the sales order and customer
information is stored in separate
tables. Although you can have columns
that contain information for both the
sales order and the customer in a
single table, this design leads to
several problems. The customer
information, name and address, must be
added and stored redundantly for each
sales order. This uses additional
storage space in the database. If a
customer address changes, the change
must be made for each sales order.
Also, if the last sales order for a
customer is removed from the
Sales.SalesOrderHeader table, the
information for that customer is lost.
A table should [try to] avoid nullable columns. Tables can have columns
defined to allow for null values. A
null value indicates that there is no
value. Although it can be useful to
allow for null values in isolated
cases, you should use them sparingly.
This is because they require special
handling that increases the complexity
of data operations. If you have a
table with several nullable columns
and several of the rows have null
values in the columns, you should
consider putting these columns in
another table linked to the primary
table. By storing the data in two
separate tables, the primary table can
be simple in design and still handle
the occasional need for storing this
information.
A table should not have repeating values or columns. The table for an
item in the database should not
contain a list of values for a
specific piece of information. For
example, a product in the
AdventureWorks database might be
purchased from multiple vendors. If
there is a column in the
Production.Product table for the name
of the vendor, this creates a problem.
One solution is to store the name of
all vendors in the column. However,
this makes it difficult to show a list
of the individual vendors. Another
solution is to change the structure of
the table to add another column for
the name of the second vendor.
However, this allows for only two
vendors. Additionally, another column
must be added if a book has three
vendors. If you find that you have to
store a list of values in a single
column, or if you have multiple
columns for a single piece of data,
such as TelephoneNumber1, and
TelephoneNumber2, you should consider
putting the duplicated data in another
table with a link back to the primary
table. The AdventureWorks database has
a Production.Product table for product
information, a Purchasing.Vendor table
for vendor information, and a third
table, Purchasing.ProductVendor. This
third table stores only the ID values
for the products and the IDs of the
vendors of the products. This design
allows for any number of vendors for a
product without modifying the
definition of the tables, and without
allocating unused storage space for
products with a single vendor.
Ref.
The idea that you ¨shouldn´t¨use a tool is just thoughtless ideology. Mathematicians use calculators although they can do all of the calculations themselves. NORMA is good, i would start with that.
I use NORMA for conceptual database design. One side-effect is that it produces the schema for a properly normalized database.
1: Don't listen to anyone who tells you that normalization does not need a tool because anyone who makes such a comment does not understand the problem.
2: The "normal forms" are best seen as measures of the quality of a database design. For example, data schemas that are in higher normal forms have less data redundancy and are less susceptible to update anomalies - which means that your application needs less program code.
3: So, if normalization is essential, then what is the best way to do it?
There are many narratives about this so I will just mention two:
Method 1: Functional Decomposition (FD)
Now it is the case that at least one famous University Professor teaches the FD method. See this video: "Stanford University Video on functional decomposition".
Unfortunately, (and sorry Jennifer), but functional decomposition is hugely complex, prone to error and in my view, totally unworkable. (e.g. How do you figure out the right "Mega" relation in the first place?)
Method 2: Use the NORMA tool to automatically generate 5th normal form schemas.
The NORMA tool is free and it runs on the free version of Visual Studio 2013.
You can learn more about it on my website.
Happy modeling.
Ken
PS I have been using the object-role modeling method for more than 20 years.
The best solution is that you should be able to fully understand how to normalize forms, for how long you will be dependent on any tools to do this for you? I would suggest you to study a bit about it so that you could come up with the best solution yourself. As a developer, you will face this every now and then, and what about an interview, let's suppose where you are asked about it? And as Mitch Wheat said, normalization should not require a tool :)
Here are some more resources to get you stared:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
Source: Mysql Website (Official)
As a beginner I used Relational Database Design.
Believe me its great! (because it works and requires no prerequisites i.e. ideal for beginners). On page 4 it covers normalization.
The best tutorial I've ever seen is Logical Data Modeling by Art Langer.
Also, the accompanying pdf by Dr. Art Langer: http://dualibra.com/wp-content/uploads/2011/09/Analysis_and_Design_of_Information_Systems__Third_Edition.pdf
http://www.youtube.com/watch?v=IiVq8M5DBkk&list=PL196FE5448948D9B4