My database skills are mediocre at best and I have to design a data model for survey data. I have spent some thoughts on this and right now I feel that I am stuck between some kind of EAV model and a design involving hundreds of tables, each with hundreds of columns (and thousands of records). There must be a better way to do this and I hope that the wise folks on this forum can help me.
My question is: how should I model the answers to survey questions in an RDBMS? Using SQL Server is mandatory. So alternative data storage systems should be excluded from this discussion. (Sure, some should and will be evaluated, but not here please.) I don't need a solution for the entire data model, for now I'm only interested in the Answers part.
I have already searched various forums, but I couldn't really find a solution. If it has already been given elsewhere, please excuse me and provide me with a link so I can read it up.
Some assumptions about the data I have to deal with:
Each survey consists of 1 to n questionnaires
Each questionnaire consists of 100-2,000 questions (please ignore that 2,000 questions really sound like a lot to answer...)
Questions can be of various types: multiple-choice, free text, a number (like age, income, percentages, ...)
Each survey involves 10-200 countries (These are not the respondents. The respondents are actually people in the countries.)
Depending on the type of questionnaire, each questionnaire is answered by 100-20,000 respondents per country.
A country can adapt the questionnaires for a survey, i.e. add, remove or edit questions
The data for one country is gathered in a separate database in that country. There is no possibility for online integration from the start.
The data for all countries has to be integrated later. This means for example, if a country has deleted a question, that data must somehow be derived from what they sent in order to achieve a uniform design across all countries
I will have to write the integration and cleaning software, which will need to work with every country's data
In the end the data needs to be exported to flat files, one rectangular grid per country and questionnaire.
I have already discussed this topic with people from various backgrounds and have not come to a good solution yet. I mainly got two kinds of opinions.
The domain experts, who are used to working with flat files (spreadsheet-style) for data processing and analysis vote for a denormalized structure with loads of tables and columns as I described above (1 table per country and questionnaire). This sounds terrible to me, because I learned that wide tables are to be avoided, it will be annoying to determine which columns are actually in a table when working with it, the database will become cluttered with hundreds of tables (or I even need to set up multiple databases, each with a similar yet a bit differetn design), etc.
O-O-programmers vote for a strongly "normalized" design, which would effectively lead to a central table containing all the answers from all respondents to all questions. This table would either need to contain a column of type sql_variant type or multiple answer columns with different types to store answers of different types (multiple choice, free text, ..). The former would essentially be a EAV model. I tend to follow Joe Celko here, who strongly discourages its use (he calls it OTLT or "One True Lookup Table"). The latter would imply that each row would contain null cells for the not applicable types by design.
Another alternative I could think of would be to create one table per answer type, i.e., one for multiple-choice questions, one for free text questions, etc.. That's not so generic, it would lead to a lot of union joins, I think and I would have to add a table if a new answer type is invented.
Sorry for boring you with all this text and thank you for your input!
Cheers,
Alex
PS: I asked the same question here: http://www.eggheadcafe.com/community/aspnet/13/10242616/survey-data-model--how-to-avoid-eav-and-excessive-denormalization.aspx
Well imgur is down so i'll post the pic later.
I think this is completely feasible within a relational model. I've built a CDM to show how I would do this.
Outbound
It takes 4 entities to define a Country's Survey. Some Parent Survey, the country and a list of questions. Your questions have an internal relationship so when one country "edits" a question, you can track both the question asked by the country and the question it came from. The other thing you need is a Possible Answer entity/table. Each question may have an associated list of possible answers (multiple choice or ranges etc). Those 4 should completely define the "OUTBOUND" side of this.
Inbound
The "INBOUND" side is just 2 new entities, The Respondent and the answer. The respondent is straightforward, just the demographics of that person if you know them and here you can include a relationship back to country. Each respondent answered the survey in a given country. (Person may be 1:n with Respondent if the person travels or has dual citizenship)
The answer is basic; either it is one of the choices listed in the list of Possible Answers or it is provided. Don't get all caught up in the fact that the answer may be a number, date, etc just yet. Either it's a FK or a string of characters.
Reporting
A report is a join over all of these... You'll choose a country and a survey, get the list of questions and answers.
Answer Complexity
Depends on where you want to do your calculations. If you used a Varchar2(4000) column for your user-provided answers, you could add an attribute to question to describe the datatype of the answer. Q: Age? DT: Integer Between (0 and 130). Then your integration layer can do the validation instead of the database enforcing it. Or you can have 4 columns, one for number, date, character and CLOB. And your integration layer will determine the column to use. When you report those answers out, you'll just select all four columns with Coalesce().
Is this an EAV because there's a slight ambiguity to the datatype of "Answer"
No, it's not.
AN EAV model breaks down an Entity into a list of attributes.
like so:
Entity Attribute Value
1 Fname Stephanie
1 Lname Page
1 Age 30
because you see the Answer column of the Survey schema is holding both words and numbers like the Value column does here you think that defines EAV. It does not. Just as if I added 3 datatype columns to this model it wouldn't change it FROM an EAV.
I soooo hate it when
I've had people tell me that the query I'm tuning has to go "as fast as possible". Ok, so give me a billion dollars and 30 years. "Wait, a Billion what?" "As much as", "as fast as" aren't requirements. You can validate anything you want in a database... build a shedload of Before triggers, voila! Validation galore.
What's the datatype of an age column? Or Birthdate column? Depends on what your data source is. Some older records may only have Month and Year, or just year, or 'around' or 'circa' some year. You couldn't have just a number column and do 'as much validation as possible'. and NUMBER(2) may be BETTER validation than just NUMBER. So now you'll have NUMBER(1), NUMBER(2), NUMBER... to have "as much as".
Where I think you are getting tripped up
Think of this as a Conceptual Data Model, not a Physical one. In those terms Survey is an entity. Is Question an entity or just an attribute of Survey. If you built One table PER you're clearly saying that Question is just an Attribute of Survey and storing them vertically makes this an EAV. What this model shows is that Question is actually another entity. There is a relationship between Questions, e.g. 'a country [can] edit questions'. There was the original question and edited one. Each question has a collection of possible answers. And the most important this is that, they are all questions. In an EAV I call fname, lname, bdate, age, major, salary, etc... all very disparate things, just attributes. In this case we're not including the name of the agency who originated the survey and the date it was issued and the date is due back and the etc... as questions.
Let me put this another way. You're Fedex. You want to store timestamps for certain events. Each time a package enters or leaves a facility or vehicle. Time on the picking up truck, time off the truck and into the first facility, time out of that facility and onto a plane, etc. Do you store them Horizontally? How do you know the number of hops in advance? If you store them vertically does that automatically make it an EAV? And if so why.
You're a weather company getting temps from stations around the country. Let's say the sensors are designed to send a reading when the temperature changes +/- a full degree. If you store a sensor_ID|timestamp|temp is a Reading Table is that an EAV? Each reading isn't an attribute of the sensor, they are themselves entities which belong to a collection/series.
One thing that vertical storage of answers has in common with an EAV is its difficulty in performing analytic queries. If you wanted a list of all the people who answered TRUE to question 5 and 10 but FALSE to 6 and 11 would be very difficult when done vertically. Maybe that's why you see this an EAV. If you want to do that, you need a different storage. The relational storage of the question and answers isn't the best reporting database. Let's go back to the Fedex example. It's not simple to do "transit" time reporting when the rows are vertical.
This sounds like you are wrestling with a common problem: how to use a hammer to fasten a screw.
Both alternatives you listed are bad, each for different reasons. But that's because you are trying to stuff your particular data model into a relational database system. A good approach would be to look beyond the relational database at some other database/storage systems, try a couple out, and find the best fit for your project.
I have tried the EAV model and gave up because it was far too complex, and I am afraid to try the multi-tables model with a relational database system. The easiest solution I have found with a relational database is: store each complete response as a single CLOB, serialized into JSON or YAML (or something else lightweight), in a responses table.
create table responses (
id uuid primary key,
questionnaire_id uuid references questionnaires.id,
data text
)
If I was using SQL Server, Express will be OK, then I would do this:
Table with list of questions, flags
for type (bit), if required flag
(bit), the correct answer if exists,
etc
Table with list of countries
Table linking of countries and
questions (some countries may not get some questions
Table for answers with columns for
the question(s) and a xml
column for the optional questions
including those which are added
If you are not versed in shredding XML then use sparse columns for all the optional questions. I do not recall exactly the limit on the number of sparse columns in a table but I believe it is above 30,000. SQL Server internally stores sparse columns as XML and will shred it when one selects the column and yes it can be indexed
The diagram below show a diagram created with SQL Server. the column AL_A4 will hold the answer to QL_Id = 4 and is of type sparse. The QL_Id in the QuestionList table is not flagged required letting you know to make the column in AnswerList sparse.
Since countries will add questions create QuestionListCustom, QuestiontoCountryCustom and AnswerListCustom tables and add the information from the custom questions.
I am sure there are other ways to design the storage, this is the way I would turn in the homework, if this is not homework then you surely work for the UN.
Have you considered not reinventing the wheel? There are open source survey applications already built. Even if they don't meet your needs, download a few and check out their data models.
Related
I'm not a DBA so I'm not familiar with the proper lingo, so maybe the title of the question could be a little misleading.
So, the thing. I have Members for a certain system, these members can be part of a demographic segment (any kind of segment: favorite color, gender, job, etc)
These are the tables
SegmentCategory
ID, Name, Description
SegmentCategory_segment
SegmentID, SegmentCategoryID
Segment
ID, Name, Description
MemberSegment
ID, MemberID, SegmentID
So the guy that designed the DB decided to go uber normalizing everything so he put the member's gender on a segment and not in the Member's table.
Is this ok? According to my logic, gender it's a property of the Member so it must be on its entity. But by doing this then there must be duplicated data (The gender on the Member and Gender as a segment) But a trigger on the Member table could just fix this (Update the segment on a gender change)
Having to crawl 4 tables just to get a property from the member seems like over engineering to me.
My question is whether I'm right or not? If so, how could I propose the change to the DBA?
There isn't a blanket rule you can apply to database decisions like this. It depends on what applications/processes it is supporting. A database for reporting is much easier to work with when it is more de-normalized (in a well thought out way) than it is a more transactional database.
You can have a customer record spread across 2 tables, for instance, if some data is accessed or updated more often than other parts. Say you only need one half of the data 90% of your queries, but don't want to drag around the the varchar(max) fields you have there for whatever reason.
Having said that, having a table with just a gender/memberid is on the far side of extreme. From my naive understanding of your situation I feel you just need a members table with views over top for your segments.
As for the DBA, ultimately I imagine it will be them who will be needing to maintain the integrity of the data, so I would just approach them and say "hey, what do you think of this?" Hopefully they'll either see the merit or be able to give you reasons to their design decisions.
I am trying to upgrade an existing database for online job finding and hiring website, the main goal is to make the table for people more browsable by adding categories, features, better tag system, and subcategories
Here is the problem: each category have it's own subcategory and features, for example when a user is seeing people in teaching category, the user might want to find out if they teach privately (home-school) so I add a bit type column for that feature, but as you might know, not all categories need a home schooling feature, for example other categories like, computer, engineer, medicine and other stuff, that means all of these rows with category other than teaching will all have a useless "NULL" in them that takes a 1 byte that might not sound a lot but at the end I might end up having tons of such useless "NULL"s in each row that wasting space.
I also can't create table for each category since the people table have relations with other tables like users,comments, images and etc....
What do you suggest I do?
It should be one option datatype varchar to nvarchare. Table column should be data type nvarchar. It is benift to given size max but store value to assign size. More detail to refer this link
http://www.c-sharpcorner.com/UploadFile/cda5ba/difference-between-char-nchar-varchar-and-nvarchar-data-ty/
15 ECTS credits worth of database design down the bin.. I really can't come up with the best design solution for my problem.
Which is this: Basically I'm making a tool that gathers a lot of information concerning the user. At the most the user would fill in 50 fields of data, ranging from simple checkboxes to text input. I'm designing the db right now (with mySql) and can't decide whether or not to use a single User table with all of those fields, or to have a table for each category of input.
One example would be "type of payment". This one has three options and if I went with the "table" way I would add a table paymentType and give it binary fields for each payment type. Then I would need and id table to identify which paymentType the user has chosen whereas if I use a single user table, the data would already be there.
The site will probably see a lot of users (tv, internet and radio marketing) so I'm concerned which alternative would be the best.
I'll be happy to provide more details if you need more to base a decision.
Thanks for reading.
Read this article "Database Normalization Basics", and come back here if you still have questions. It should help a lot.
The most fundamental idea behind these decisions, as you will see in this article, is that each table should represent one and only one "thing", and each field should relate directly and only to that thing.
In your payment types example, it probably makes sense to break it out into a separate table if you anticipate the need to store additional information about each payment type.
Create your "Type of Payment" table; there's no real question there. That's proper normalization and the power behind using relational databases. One of the many reasons to do so is the ability to update a Type of Payment record and not have to touch the related data in your users table. Your join between the two tables will allow your app to see the updated type of payment info by changing it in just the 1 place.
Regarding your other fields, they may not be as clear cut. The question to ask yourself about each field is "does this field relate only to a user or does it have meaning and possible use in its own right?". If you can never imagine a field having meaning outside of the context of a user you're safe leaving it as a field on the user table, otherwise do the primary key-foreign key relationship and put the information in its own table.
If you are building a form with variable inputs, I wouldn't recommend building it as one table. This is inflexible and dirty.
Normalization is the key, though if you end up with a key/value setup, or effectively a scalar type implementation across many tables and can't cache:
a) the form definition from table data and
b) the joined result of storage (either a caching view or otherwise)
c) or don't build in proper sharding
Then you may hit a performance boundary.
In this KVP setup, you might want to look at something like CouchDB or a less table-driven storage format.
You may also want to look at trickier setups such as serialized object storage and cache-tables if your internal data is heavily relative to other data already in the database
50 columns is a lot. Have you considered a table that stores values like a property sheet? This would only be useful if you didn't need to regularly query the values it contains.
INSERT INTO UserProperty(UserID, Name, Value)
VALUES(1, 'PaymentType', 'Visa')
INSERT INTO UserProperty(UserID, Name, Value)
VALUES(1, 'TrafficSource', 'TV')
I think I figured out a great way of solving this. Thanks to a friend of mine for suggesting this!
I have three tables, Field {IdField, FieldName, FieldType}, FieldInput {IdInput, IdField, IdUser} and User { IdUser, UserName... etc }
This way it becomes very easy to see what a user has answered, the solution is somewhat scalable and it provides a good overview. I will constrain the alternatives in another layer, farther away from the db. I believe it's a tradeoff worth doing.
Any suggestions or critics to this solution?
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Are there any tools or online resources (FREX tutorials) that would help a neophyte with database normalization?
An Introduction to Database Normalization
Wiki: Database normalization
Database Normalization Tips
Achieving a Well-Designed Database In relational-database design theory,
normalization rules identify certain
attributes that must be present or
absent in a well-designed database.
There are a few rules that
can help you achieve a sound database
design:
A table should have an identifier. The fundamental rule of database
design theory is that each table
should have a unique row identifier, a
column or set of columns used to
distinguish any single record from
every other record in the table. Each
table should have an ID column, and no
two records can share the same ID
value. The column or columns serving
as the unique row identifier for a
table are the primary key of the
table. In the AdventureWorks database,
each table contains an identity column
as the primary key column. For
example, VendorID is primary key for
the Purchasing.Vendor table.
A table should store only data for a single type of entity. Trying to
store too much information in a table
can hinder the efficient and reliable
management of the data in the table.
In the AdventureWorks sample database,
the sales order and customer
information is stored in separate
tables. Although you can have columns
that contain information for both the
sales order and the customer in a
single table, this design leads to
several problems. The customer
information, name and address, must be
added and stored redundantly for each
sales order. This uses additional
storage space in the database. If a
customer address changes, the change
must be made for each sales order.
Also, if the last sales order for a
customer is removed from the
Sales.SalesOrderHeader table, the
information for that customer is lost.
A table should [try to] avoid nullable columns. Tables can have columns
defined to allow for null values. A
null value indicates that there is no
value. Although it can be useful to
allow for null values in isolated
cases, you should use them sparingly.
This is because they require special
handling that increases the complexity
of data operations. If you have a
table with several nullable columns
and several of the rows have null
values in the columns, you should
consider putting these columns in
another table linked to the primary
table. By storing the data in two
separate tables, the primary table can
be simple in design and still handle
the occasional need for storing this
information.
A table should not have repeating values or columns. The table for an
item in the database should not
contain a list of values for a
specific piece of information. For
example, a product in the
AdventureWorks database might be
purchased from multiple vendors. If
there is a column in the
Production.Product table for the name
of the vendor, this creates a problem.
One solution is to store the name of
all vendors in the column. However,
this makes it difficult to show a list
of the individual vendors. Another
solution is to change the structure of
the table to add another column for
the name of the second vendor.
However, this allows for only two
vendors. Additionally, another column
must be added if a book has three
vendors. If you find that you have to
store a list of values in a single
column, or if you have multiple
columns for a single piece of data,
such as TelephoneNumber1, and
TelephoneNumber2, you should consider
putting the duplicated data in another
table with a link back to the primary
table. The AdventureWorks database has
a Production.Product table for product
information, a Purchasing.Vendor table
for vendor information, and a third
table, Purchasing.ProductVendor. This
third table stores only the ID values
for the products and the IDs of the
vendors of the products. This design
allows for any number of vendors for a
product without modifying the
definition of the tables, and without
allocating unused storage space for
products with a single vendor.
Ref.
The idea that you ¨shouldn´t¨use a tool is just thoughtless ideology. Mathematicians use calculators although they can do all of the calculations themselves. NORMA is good, i would start with that.
I use NORMA for conceptual database design. One side-effect is that it produces the schema for a properly normalized database.
1: Don't listen to anyone who tells you that normalization does not need a tool because anyone who makes such a comment does not understand the problem.
2: The "normal forms" are best seen as measures of the quality of a database design. For example, data schemas that are in higher normal forms have less data redundancy and are less susceptible to update anomalies - which means that your application needs less program code.
3: So, if normalization is essential, then what is the best way to do it?
There are many narratives about this so I will just mention two:
Method 1: Functional Decomposition (FD)
Now it is the case that at least one famous University Professor teaches the FD method. See this video: "Stanford University Video on functional decomposition".
Unfortunately, (and sorry Jennifer), but functional decomposition is hugely complex, prone to error and in my view, totally unworkable. (e.g. How do you figure out the right "Mega" relation in the first place?)
Method 2: Use the NORMA tool to automatically generate 5th normal form schemas.
The NORMA tool is free and it runs on the free version of Visual Studio 2013.
You can learn more about it on my website.
Happy modeling.
Ken
PS I have been using the object-role modeling method for more than 20 years.
The best solution is that you should be able to fully understand how to normalize forms, for how long you will be dependent on any tools to do this for you? I would suggest you to study a bit about it so that you could come up with the best solution yourself. As a developer, you will face this every now and then, and what about an interview, let's suppose where you are asked about it? And as Mitch Wheat said, normalization should not require a tool :)
Here are some more resources to get you stared:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
Source: Mysql Website (Official)
As a beginner I used Relational Database Design.
Believe me its great! (because it works and requires no prerequisites i.e. ideal for beginners). On page 4 it covers normalization.
The best tutorial I've ever seen is Logical Data Modeling by Art Langer.
Also, the accompanying pdf by Dr. Art Langer: http://dualibra.com/wp-content/uploads/2011/09/Analysis_and_Design_of_Information_Systems__Third_Edition.pdf
http://www.youtube.com/watch?v=IiVq8M5DBkk&list=PL196FE5448948D9B4
I have a table in a DB (Postgres based), which acts like a superclass in object-oriented programming. It has a column 'type' which determines, which additional columns should be present in the table (sub-class properties). But I don't want the table to include all possible columns (all properties of all possible types).
So I decided to make a table, containg the 'key' and 'value' columns (i.e. 'filename' = '/file', or 'some_value' = '5'), which contain any possible property of the object, not included in the superclass table. And also made one related table to contain the available 'key' values.
But there is a problem with such architecture - the 'value' column should be of a string data type by default, to be able to contain anything. But I don't think converting to and from strings is a good decision. What is the best way to bypass this limitation?
The design you're experimenting with is a variation of Entity-Attribute-Value, and it comes with a whole lot of problems and inefficiencies. It's not a good solution for what you're doing, except as a last resort.
What could be a better solution is what fallen888 describes: create a "subtype" table for each of your subtypes. This is okay if you have a finite number of subtypes, which sounds like what you have. Then your subtype-specific attributes can have data types, and also a NOT NULL constraint if appropriate, which is impossible if you use the EAV design.
One remaining weakness of the subtype-table design is that you can't enforce that a row exists in the subtype table just because the main row in the superclass table says it should. But that's a milder weakness than those introduced by the EAV design.
edit: Regarding your additional information about comments-to-any-entity, yes this is a pretty common pattern. Beware of a broken solution called "polymorphic association" which is a technique many people use in this situation.
How about this instead... each sub-type gets its own DB table. And the base/super table just has a varchar column that holds the name of the sub-type DB table. Then you can have something like this...
Entity
------
ID
Name
Type
SubTypeName (value of this column will be 'Dog')
Dog
---
VetName
VetNumber
etc
If you don't want your (sub-)table names to be varchar values in the base table, you can also just have a SubType table whose primary key will be in the base table.
The only workaround (while retaining your strucure) is to have separate tables:
create table IntProps(...);
create table StringProps(...);
create table CurrencyProps(...);
But I do not think that this is a good idea...
One common approach is having the key-value table contain multiple columns, one for each data type, i.e. StringValue, DecimalValue, etc.
Just know you're trading queryability and performance for a database schema you don't need to change. You could also consider ORM mapping or an object database.
You could have a per type key/value table. The available table would need to encode the availability of a specific key/type pair to point to the correctly typed key/value table.
This seems like a highly inefficient architecture in for a row based relational databases however.
Perhaps you should take a look at a column oriented relational database?
Thanks for the answers. I'll explain a little bit more specifically what i need.
There's a need to program a blog+forum website, and I've been looking at the WordPress DB structure.
There's a strong need for the ability to place comments to any kind of 'object', like a blog entry, or a video file attachment to it. The above DB structure being very easy to scale and to fulfill all our needs was the reason of its choice.
But that's not late to change it, cause this is in stage of early engineering. Also our model smells now like a completely tree-hierarchy based DB. For now I'll accept Bill Karwin's and fallen888 answers, but maybe I'm going in a totally wrong direction?
about the user being able to add a new field to the table:
I admire all these people making comments.
I used to be interested in this kind of thing a few years ago, but have written little code recently (apart from a little bit of PHP and MYSQL).
I think it's fine if you want to keep going - you may end up with something new.
Sorry to pour any cold water on the scheme - I admire your efforts. My personal belief is that if you go far enough in this direction, you will end up with a system that interprets more of natural language than SQL does. (Around 1970, SQL was actually spelt Sequel, and it actually stood for "structured english query language", but after they standardized it in the 1970's - I think someone said that Oracle was the first commercial implementation, 19079, the "English" got dropped off, because I guess they decided that it was only a tiny subset of English.
I have run out of steam in this area, because I haven't got a job. Without an easy job that pays the bills, where I can experiment with these ideas, it's a bit hard to concentrate on this area.
Best wishes to all.
sorry, I wrote 19079 above, I meant the year 1979. Oracle got their first contract writing a database for the CIA.