Database design for a survey [closed] - sql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I need to create a survey where answers are stored in a database. I'm just wondering what would be the best way to implement this in the database, specifically the tables required. The survey contains different types of questions. For example: text fields for comments, multiple choice questions, and possibly questions that could contain more than one answer (i.e. check all that apply).
I've come up with two possible solutions:
Create a giant table which contains
the answers for each survey
submission. Each column would
correspond to an answer from the
survey. i.e. SurveyID, Answer1,
Answer2, Answer3
I don't think this is the best way
since there are a lot of questions
in this survey and doesn't seem very
flexible if the survey is to change.
The other thing I thought of was
creating a Question table and Answer
table. The question table would
contain all the questions for the
survey. The answer table would contain
individual answers from the survey,
each row linked to a question.
A simple example:
tblSurvey: SurveyID
tblQuestion: QuestionID, SurveyID, QuestionType, Question
tblAnswer: AnswerID, UserID, QuestionID, Answer
tblUser: UserID, UserName
My problem with this is that there
could be tons of answers which would
make the Answer table pretty huge.
I'm not sure that's so great when it
comes to performance.
I'd appreciate any ideas and suggestions.

I think that your model #2 is fine, however you can take a look at the more complex model which stores questions and pre-made answers (offered answers) and allows them to be re-used in different surveys.
- One survey can have many questions; one question can be (re)used in many surveys.
- One (pre-made) answer can be offered for many questions. One question can have many answers offered. A question can have different answers offered in different surveys. An answer can be offered to different questions in different surveys. There is a default "Other" answer, if a person chooses other, her answer is recorded into Answer.OtherText.
- One person can participate in many surveys, one person can answer specific question in a survey only once.

My design is shown below.
The latest create script is at https://gist.github.com/durrantm/1e618164fd4acf91e372
The script and the mysql workbench.mwb file are also available at
https://github.com/durrantm/survey

Definitely option #2, also I think you might have an oversight in the current schema, you might want another table:
+-----------+
| tblSurvey |
|-----------|
| SurveyId |
+-----------+
+--------------+
| tblQuestion |
|--------------|
| QuestionID |
| SurveyID |
| QuestionType |
| Question |
+--------------+
+--------------+
| tblAnswer |
|--------------|
| AnswerID |
| QuestionID |
| Answer |
+--------------+
+------------------+
| tblUsersAnswer |
|------------------|
| UserAnswerID |
| AnswerID |
| UserID |
| Response |
+------------------+
+-----------+
| tblUser |
|-----------|
| UserID |
| UserName |
+-----------+
Each question is going to probably have a set number of answers which the user can select from, then the actual responses are going to be tracked in another table.
Databases are designed to store a lot of data, and most scale very well. There is no real need to user a lesser normal form simply to save on space anymore.

As a general rule, modifying schema based on something that a user could change (such as adding a question to a survey) should be considered fairly smelly. There's cases where it can be appropriate, particularly when dealing with large amounts of data, but know what you're getting into before you dive in. Having just a "responses" table for each survey means that adding or removing questions is potentially very costly, and it's very difficult to do analytics in a question-agnostic way.
I think your second approach is best, but if you're certain you're going to have a lot of scale concerns, one thing that has worked for me in the past is a hybrid approach:
Create detailed response tables to store per-question responses as you've described in 2. This data would generally not be directly queried from your application, but would be used for generating summary data for reporting tables. You'd probably also want to implement some form of archiving or expunging for this data.
Also create the responses table from 1 if necessary. This can be used whenever users want to see a simple table for results.
For any analytics that need to be done for reporting purposes, schedule jobs to create additional summary data based on the data from 1.
This is absolutely a lot more work to implement, so I really wouldn't advise this unless you know for certain that this table is going to run into massive scale concerns.

The second approach is best.
If you want to normalize it further you could create a table for question types
The simple things to do are:
Place the database and log on their own disk, not all on C as default
Create the database as large as needed so you do not have pauses while the database grows
We have had log tables in SQL Server Table with 10's of millions rows.

No 2 looks fine.
For a table with only 4 columns it shouldn't be a problem, even with a good few million rows. Of course this can depend on what database you are using. If its something like SQL Server then it would be no problem.
You'd probably want to create an index on the QuestionID field, on the tblAnswer table.
Of course, you need to specify what Database you are using as well as estimated volumes.

You may choose to store the whole form as a JSON string.
Not sure about your requirement, but this approach would work in some circumstances.

Looks pretty complete for a smiple survey. Don't forget to add a table for 'open values', where a customer can provide his opinion via a textbox. Link that table with a foreign key to your answer and place indexes on all your relational columns for performance.

Number 2 is correct. Use the correct design until and unless you detect a performance problem. Most RDBMS will not have a problem with a narrow but very long table.

Having a large Answer table, in and of itself, is not a problem. As long as the indexes and constraints are well defined you should be fine. Your second schema looks good to me.

Given the proper index your second solution is normalized and good for a traditional relational database system.
I don't know how huge is huge but it should hold without problem a couple million answers.

Related

Adding columns versus adding rows - which offers better performance?

Searched and searched. Not sure how to use Explain/Analyze to answer this, without constructing really large test tables and I don't have the means or time to pursue that. Certainly someone can confidently answer this likely simple question for me and save me hours of testing to find out.
I have a table which looks something like this:
id | destination_id | key | value | json_profile_data | deleted_bool | deleted_timestamp
The key and value were the original use of the table, but we recently began storing json arrays instead and now the key/value fields are unused. I want to add 3 new bits of data to this record id. My instinct is to make new columns in each row for the 3 new fields, but my associate wants to use the key/value cols to add the information using the same destination_id.
MY proposal means less rows in the table and looks like this:
id | destination_id | key | value | json_profile_data | claim_code | claim_date | claim_approved_bool | deleted_bool | deleted_timestamp
HIS solution is to add new rows, using the key/value cols to insert the three new bits of info with the same destination_id as their parent row on these new rows.
id | destination_id | null | null | json_profile_data | deleted_bool | deleted_timestamp
id | destination_id | claim_code | value | null | deleted_bool | deleted_timestamp
id | destination_id | claim_date | value | null | deleted_bool | deleted_timestamp
id | destination_id | claim_approved_bool | value | null | deleted_bool | deleted_timestamp
His solution makes 4 rows per destination_id, mine makes 3 new columns on existing row for a given destination_id.
Which is more performant for selects against this table? Or does it matter? I hope I have written this in a way where its clear. Let me know if more elaboration is needed.
As with most things database, the answer is "it depends". In particular, is mostly depends on what resultset needs to be returned, what predicates are specified, the indexes are available, cardinality, etc.
With that said, in general, adding columns to the table would likely give better performance than adding rows.
A more important issue (I think) is the design of the insert/update/delete operations.
The original table looks like an implementation of an EAV (Entity Attribute Value) model; queries against EAV can get notoriously complicated when the results need to be "pivoted", and returned in a different format; or when we have predicates on multiple attributes.
To stick with the EAV model, we'd add rows to the table, and grind through the more complicated SQL that's required to work with that.
But if improved performance is the goal, we'd probably avoid EAV model entirely, and just store attributes as columns. That's the traditional relational database model: each row represents an "entity" (i.e. person, place, thing, concept or event that can be uniquely identified and we need to store information about), and each column represents an "attribute", a piece of information about the entity.
As you said, you'd have to try this with realistic volumes of data to see it empirically, but there's no question that the 'added columns' approach will be more performant. The other method will require four joins, which will almost certainly slow things down.
Your associate is suggesting EAV storage. Ample details in this related question on dba.SE:
Is there a name for this database structure?
The rest is for Postgres, only applicable to MySQL in parts.
You already have a json column, which is the obvious third solution to the problem, but neither of you seems to consider that? Maybe even just adding to the json column in place (That's not what I would normally do, though.) Actually, if you go that route consider the new jsonb in the upcoming Postgres 9.4.
However, as long as we we are talking about just those three columns (and not another new column every n weeks), your approach wins the performance bet in almost any possible aspect, by a long shot, too. Additional columns are much cheaper. Even if they are NULL most of the time, since NULL storage is very cheap:
Making sense of Postgres row sizes
Do nullable columns occupy additional space in PostgreSQL?
Storage size is a major contributor to performance.
Either method can be indexed. For EAV storage you can employ partial indexes. To optimize this, one needs to know typical queries, access patterns, requirements and priorities. Your approach is typically simpler to manage.
The obvious aspects where your approach would loose:
If there is a variable (yet unknown) number of new columns you need to add on the fly. That's much simpler with the EAV approach.
If you have lots of updates to only (one of) the new columns. That's cheaper with small separate rows.
Recent related answer discussing many columns in a table, with a code for cross-tabulation, often needed for EAV storage:
SQL : Create a full record from 2 tables

Automatically numbering and referencing Sphinx tables

Is there a way to reference Sphinx tables the following way:
.. table Supertable
+--------+----+
|Foo |Bar |
+--------+----+
And then:
:table:`Supertable`
And magic! The problem is, that there are certain tables, that could be referenced throughout one document and such linking could come very useful.
On a side note the approach I've illustrated my question with doesn't work.
Also, as another part of the same question, is there a way to automatically numerate the tables? I'm pretty positive I've seen one somewhere, but it could be something manual. I mean like in:
Table 11: Consumption of peanut butter by the state..
This functionality, if available at all, is eluding me.
OK, I've found an answer to the first part of my question. Actually it's a nobrainer:
.. _table:
.. table Supertable
+--------+----+
|Foo |Bar |
+--------+----+
And then:
:ref:`table`
As for enumerated tables I have actually seen enumerated figures, not tables and it gets done in LaTeX output. I looked around and haven't found any trace of automatically enumerated tables in Sphinx. It would probably make a good feature request, but for now there seems to be no such feature.
PS: I've checked and actually tables are also enumerated in LaTeX. There is also a related problem discussed in this question.

Quiz database setup for SQL Server

I want to create a database and I am trying to think ahead to the future with results retrieval and the time it takes to perform tasks. Basically, I am going to have a table that hold the answers that were given on the quiz. There are 48 questions. Is it better to have one long row with all of the answers given and the column names after the question number. Or should I have one row with each answer and the question ID. Either way makes sense to me but I am pretty new at this.
48 questions/columns are a lot. And what happens if you want to have only 12 or 50 questions tomorrow?
A design like
player_id | quiz_id | question_id | answer_id
will give you more flexibility in the future.

Would DB performance be better to store all responses from the user in the same row or multiple rows

I need to create a table to store a user’s responses to a question and they can have up to 12 responses, what table structure would work best. I have created 2 options but if you have a better Idea I am open for suggestions.
Table 1 (Store each answer in a new row)
UserId
QuestionId
Answer Number
Answer
Table 2(Store all answers in one row)
UserId
QuestionId
Answer 1
Answer2
Answer3
Answer4
Answer5
Answer6
Answer7
Answer8
Answer9
Answer10
Answer11
Answer12
giving each answer its own row would better. so i would recommend going with your idea for table 1. that way if you want to up the limit from 12 to say 20 you do not need to add a new column and you can count responses easier.
You don't want redundancy and unnecessary/unused columns. From proper db design, you should definitely go with option one. This is a more normalized, and will add value if you decide to scale it any time later.
I'd recommend neither design.
All answers in one row breaks first normal form.
I'd have a Question table, a User table, and an Answer table. A User could be given many Questions; there's one Answer per Question.
The answer is option 2 will perform better, because you only need one I/O operation to retrieve all answers. I once built a data warehouse with a similar "wide" design, and it performed amazingly well.
...but typically, performance shouldn't be the only consideration.
From a database design point of view, it's better to use one row per answer.
This is because:
adding columns (to cater for more answers) involves a schema change (much harder), but adding rows does not
rows are scaleable (what if someone had 1000 answers - are you going to 1000 columns?)
queries are easier - you must actually name each answer if stored in columns, but with rows you name only the answer column and use SQL to pull everything together
Unless raw speed is your stand out goal, prefer option 2 (more rows) over option 1 (more columns).
From a true performance perspective it depends (from a good database design perspective it's a no brainer, multiple rows is the way to go).
If all your answers fit within a single page and you're seeking that row using a clustered index it is probably going to be slightly faster with solution 2. Your tree would have less leaves making the search of a smaller dataset. You also avoid the Cartesian that comes with a join.
Solution 1 will be a little faster if you have page splits. As long as the join column is indexed of course.
Though the in the end minor performance increase you could get with option 1 over option 2 would probably be insignificant compared to the maintenance costs of bad design.
You should definitely store the answers as separate records.
If you store the answers in one record, you will have data (the answer number) in the field names, so that breaks the first normal form. This is a sign of a really bad database design.
With the answers in separate records it's easier to access the data. Consider for example that you want to get the last answer for each question and user. This is very easy if you have the answers as separate records, but very complicated if you have them in a single record.
The first option would need to store the user-id multiple times too.
I would go for the second option, especially if you can put a hard limit on it such as 12.
This also requires only a single write operation for the database.
What are these 12 things ... months?

Is this a bad approach to database design?

I have to build an application for my university that will count as course credit for a Class that lasts 1 month. In this application I have to have a way for users to save a Teacher Class Followup Evaluation, which is a person goes to the classroom and checks out the teacher and ticks certain columns.
An example would be:
Pedagogical Aspects:
Show order and follows class sequence: YES NO Observations
Gives clear examples: YES NO Observations
Involves students in discussion: YES NO Observations
If the user (the one evaluating) chooses YES, then nothing is written in Observations, but if he chooses NO, he has to write observations without fail.
How could I handle this in my database? I'm having doubts about over normalizing. :x Any suggestion would be welcome at this point before I move on with the project.
My plan as of now is to just have a big table called Followup that has all these 'aspects' with a BIT datatype in Microsoft SQL and have a ShowOrderSequenceObservation field for every aspect that can be null. O_O I feel dirty just thinking about so I turn to you fellow developers. Thank you!
I would do something like this:
Table for the actual record - note that this is an anonymous recording from the student perspective
| record_id | question_id | YESNO | observation | teacher_id |
Table of questions.
| question_id | question_string |
Table of teachers:
| teacher_id | teacher_string |
In the general flow of things, I would also update the student table to note "has recorded" and insert the answers all in one transaction. This would preserve student anonymity yet also get the data in.
edit - I have no idea how I would ORM this thing. If I was developing it, I'd hack it out in 10-30 hours with perl and direct sql access. Most of the time would be spent beating on HTML formatting.
Sounds like the age old question of time vs quality. A denormalised table would certainly be fast and easy, but a normalised one with category and question tables would allow flexibility. You uni could use it for other things, allow new question types to be set up on the fly etc, and could get you a better grade.
If you think you can get what you want with a denormalised table, I'd go that way. It's not a production system and business needs aren't going to change in its lifetime. But if you want to push for the blue ribbon solution, I'd normalise it.
BTW, adding a < br > at the end of each option makes it more readable.
You know normalization isn't just for large enterprise level database (I know you know :). History has shown that if you don't normalize you will get anomalies. Start with 5NF and 'optimize' from there, though I suspect you will find that optimization is not required.
I suspect the propsed design will not suit its intended purpose e.g. data analysis. Try writing some typical SQL queries against it (e.g. average length of Observations across all Pedagogical questions then across all questions) and you will find it a pain: huge CASE statements, tables UNIONed many times over, ... it's likely you will end up writing VIEWs to normalise the data!