Adding columns versus adding rows - which offers better performance? - sql

Searched and searched. Not sure how to use Explain/Analyze to answer this, without constructing really large test tables and I don't have the means or time to pursue that. Certainly someone can confidently answer this likely simple question for me and save me hours of testing to find out.
I have a table which looks something like this:
id | destination_id | key | value | json_profile_data | deleted_bool | deleted_timestamp
The key and value were the original use of the table, but we recently began storing json arrays instead and now the key/value fields are unused. I want to add 3 new bits of data to this record id. My instinct is to make new columns in each row for the 3 new fields, but my associate wants to use the key/value cols to add the information using the same destination_id.
MY proposal means less rows in the table and looks like this:
id | destination_id | key | value | json_profile_data | claim_code | claim_date | claim_approved_bool | deleted_bool | deleted_timestamp
HIS solution is to add new rows, using the key/value cols to insert the three new bits of info with the same destination_id as their parent row on these new rows.
id | destination_id | null | null | json_profile_data | deleted_bool | deleted_timestamp
id | destination_id | claim_code | value | null | deleted_bool | deleted_timestamp
id | destination_id | claim_date | value | null | deleted_bool | deleted_timestamp
id | destination_id | claim_approved_bool | value | null | deleted_bool | deleted_timestamp
His solution makes 4 rows per destination_id, mine makes 3 new columns on existing row for a given destination_id.
Which is more performant for selects against this table? Or does it matter? I hope I have written this in a way where its clear. Let me know if more elaboration is needed.

As with most things database, the answer is "it depends". In particular, is mostly depends on what resultset needs to be returned, what predicates are specified, the indexes are available, cardinality, etc.
With that said, in general, adding columns to the table would likely give better performance than adding rows.
A more important issue (I think) is the design of the insert/update/delete operations.
The original table looks like an implementation of an EAV (Entity Attribute Value) model; queries against EAV can get notoriously complicated when the results need to be "pivoted", and returned in a different format; or when we have predicates on multiple attributes.
To stick with the EAV model, we'd add rows to the table, and grind through the more complicated SQL that's required to work with that.
But if improved performance is the goal, we'd probably avoid EAV model entirely, and just store attributes as columns. That's the traditional relational database model: each row represents an "entity" (i.e. person, place, thing, concept or event that can be uniquely identified and we need to store information about), and each column represents an "attribute", a piece of information about the entity.

As you said, you'd have to try this with realistic volumes of data to see it empirically, but there's no question that the 'added columns' approach will be more performant. The other method will require four joins, which will almost certainly slow things down.

Your associate is suggesting EAV storage. Ample details in this related question on dba.SE:
Is there a name for this database structure?
The rest is for Postgres, only applicable to MySQL in parts.
You already have a json column, which is the obvious third solution to the problem, but neither of you seems to consider that? Maybe even just adding to the json column in place (That's not what I would normally do, though.) Actually, if you go that route consider the new jsonb in the upcoming Postgres 9.4.
However, as long as we we are talking about just those three columns (and not another new column every n weeks), your approach wins the performance bet in almost any possible aspect, by a long shot, too. Additional columns are much cheaper. Even if they are NULL most of the time, since NULL storage is very cheap:
Making sense of Postgres row sizes
Do nullable columns occupy additional space in PostgreSQL?
Storage size is a major contributor to performance.
Either method can be indexed. For EAV storage you can employ partial indexes. To optimize this, one needs to know typical queries, access patterns, requirements and priorities. Your approach is typically simpler to manage.
The obvious aspects where your approach would loose:
If there is a variable (yet unknown) number of new columns you need to add on the fly. That's much simpler with the EAV approach.
If you have lots of updates to only (one of) the new columns. That's cheaper with small separate rows.
Recent related answer discussing many columns in a table, with a code for cross-tabulation, often needed for EAV storage:
SQL : Create a full record from 2 tables

Related

Can I create an Sql Server index that includes a summary of some child rows?

Is it possible in Sql Server 2016 to create some indexed View (or equiv) that summarizes a simple Parent / Child relationship.
e.g.:
Parent: Vehicles
Children: Parts (the parts that make up the car)
Children: Workers (the people who helped build the car)
And results like this..
e.g.:
Car | Parts | Workers
Car_A | 1111 | 4
Car_B | 123412 | 54
Car_C | 0 | 0
I guess if I was to write this as a SQL query (which I'm assuming is not indexed then and has to 'calculate' the entire answer):
SELECT a.CarId
FROM Cars a
LEFT OUTER JOIN Parts b ON a.CarId = b.CarId
LEFT OUTER JOIN Workers c ON a.CarId = b.CarId
Now the reason I'm after an 'INDEXED' view is that I'm assuming that the calculations are stored on disk so I only have to do a scan of the index to get ther results, instead of having to calculate the results, on exection.
The example above is also contrite (for this question). In reality i have a larger table structure, etc. etc.
I do understand that Indexed Views can't have specific keywords, like COUNT (as mentioned by #brentozar in a blog post)
Am I looking at this problem the wrong way? I don't really want to reply on SQL caching and be a bit more proactive, here.
(Note: Maybe I've been doing many Map/Reduce queries in another NoSql db :) )
Sure you can make it.
The data is persisted to disk so the querying is faster, but the downside is that the index (or the view) has to be maintained on every change in the underlying tables/views data.
More on it, read this.
EDIT: You've just got to use COUNT_BIG() instead of COUNT() to get the parts and workers numbers.
EDIT2: After the author's comment below I see I've made a mistake because I haven't understood it's all about the joins. I can't see a great solution to this considering you're aggregating over the vehicles...
The only solution I can see is if you're willing to have a certain "lag" in the data - then you could dump the data into a table. You could have a job that calculates the select in a temporary table/table variable and then within a transaction truncates a table and dumps the new aggregated data into it. This way your data would be accessible fast from a table and the exchange of old vs. new data would be fast, but as I mentioned it wouldn't be completely real-time data.

SQL table design: one or multiple line per entity? [duplicate]

I was wondering if you have a website with a dozen different types of listings (Shops, Restaurants, Clubs, Hotels, Events) that require different fields, is there a benefit of creating a table with columns defined like so
Example Shop:
shop_id | name | X | Y | city | district | area | metro | station | address | phone | email | website | opening_hours
Or a more abstract approach similar to this:
object_id | name
---------------
1 | Messy Joe's
2 | Bate's Motel
type_id | name
---------------
1 | hotel
2 | restaurant
object_id | type_id
---------------
1 | 2
2 | 1
field_id | name | field_type
---------------
1 | address | text
2 | opening_hours | date
3 | speciality | text
type_id | field_id
---------------
1 | 1
1 | 2
2 | 1
2 | 3
object_id | field_id | value
1 | 1 | 1st street....
1 | 3 | English Cuisine
Of course it can be more abstract if value's are predefined (Example: specialties could have their own list)
If I take the abstract approach it can be very flexible, but queries will be more complex with a lot of joins.
But I don't know if this affects the performance, executing these 'more complex' queries.
I would be interested to know what are the up and downsides of both methods. I can just imagine for myself, but I don't have the experience to confirm this.
Certain issues need to be clarified and resolved before we can enter into a reasonable discussion.
Pre-requisite Resolution
Labels
In a profession that demands precision, it is important that we use precise labels, to avoid confusion, and so that we can communicate without having to use long-winded descriptions and qualifiers.
What you have posted as FixedTables, is Unnormalised. Fair enough, it may be an attempt at Third Normal form, but in fact it is a flat file, Unnormalised (not "denormalised). What you have posted as AbstractTables is, to be precise, Entity-Attribute-Value, which is almost, but not quite, Sixth Normal form, and is therefore more Normalised than 3NF. Assuming it is done correctly, of course.
The Unnormalised flat file is not "denormalised". It is chock full of duplication (nothing has been done to remove repeating groups and duplicate columns or to resolve dependencies) and Nulls, it is a performance hog in many ways, and prevents concurrency.
In order to be Denormalised, it has to first be Normalised, and then the Normalisation backed off a little for some good reason. Since it is not Normalised in the first place, it cannot be Denormalised. It is simply Unnormalised.
It cannot be said to be denormalised "for performance", because being a performance hog, it is the very antithesis of performance. Well, they need a justification for the lack of formalised design], and "for performance" is it. Even the smallest formal scrutiny exposed the misrepresentation (but very few people can provide, so it remains hidden, until they get an outsider to address, you guessed it, the massive performance problem).
Normalised structures perform far better than Unnormalised structures. More normalised structures (EAV/6NF) perform better than less normalised structures (3NF/5NF).
I am agreeing with the thrust of OMG Ponies, but not their labels and definitions
rather than saying 'don't "denormalise" unless you have to', I am saying, 'Normalise faithfully, period' and 'if there is a performance problem, you have not Normalised correctly'.
Wikipedia
The entries for Normal Forms and Normalisation offer definitions that are incorrect; they confuse the Normal Forms; they are lacking regarding the process of Normalisation; and they give equal weight to absurd or questionable NFs which have been debunked long ago. The result is, Wikipedia adds to an already confused and rarely understood subject. So don't waste your time.
However, in order to progress, without that reference posing a hindrance, let me say this.
The definition of 3NF is stable, and has not changed.
There is a lot of confusion of the NFs between 3NF and 5NF. The truth is that this is an area that progressed over the last 15 years; and many orgs, academics as well as vendors with their products with limitations, jumped to create a new "Normal Form" to validate their offerings. All serving commercial interests and academically unsound. 3NF in its original untampered state intended and guaranteed certain attributes.
The sum total is, 5NF is today, what 3NF was intended to be 15 years ago, and you can skip the commercial banter and the twelve or so "special" (commercial and pseudo-academic) NFs in-between, some of which are identified in Wikipedia, and even that in confusing terms.
Fifth Normal Form
Since you have been able to understand and implement the EAV in your post, you will have no problem understanding the following. Of course a true Relational Model is pre-requisite, strong keys, etc. Fifth Normal Form is, since we are skipping the Fourth:
Third Normal Form
which in simple definitive terms is, every non-key column in every table has a 1::1 relationship to the Primary Key of the table,
and to no other non-key columns
Zero data duplication (the result, if Normalisation is progressed diligently; not achieved by intelligence or experience alone, or by working toward it as a goal without the formal process)
no Update Anomalies (when you update a column somewhere, you do not have to update the same column located somewhere else; the column exists in one and only one place).
If you understand the above, 4NF, BCNF, and all the silly "NFs" can be dismissed, they are required for physicalised Record Filing Systems, as promoted by academics, quite foreign to the Relational Model (Codd).
Sixth Normal Form
The purpose is elimination of missing data (attribute columns), aka elimination of Nulls
This is the one true solution to the Null Problem (also called Handling Missing Values), and the result is a database without Nulls. (It can be done at 5NF with standards and Null substitutes but that is not optimal.) How you interpret and display the missing values is another story.
Technically, is not a true Normal Form, because it does not have 5NF as a pre-requisite, but it has a value
EAV vs Sixth Normal Form
All the databases I have written, except one, are pure 5NF. I have worked with (administered, fixed up, enhanced) a couple of EAV databases, and I have implemented many true 6NF databases. EAV is a loose implementation of 6NF, often done by people who do not have a good grasp on Normalisation and the NFs, but who can see the value in, and need the flexibility of, EAV. You are a perfect example.
The difference is this: because it is loose, and because implementers do not have a reference (6NF) to be faithful to, they only implement what they need, and they write it all in code; that ends up being an inconsistent model.
Whereas, a pure 6NF implementation does have a pure academic reference point, and thus it is usually tighter, and consistent. Typically this shows up in two visible elements:
6NF has a catalogue to contain metadata, and everything is defined in metadata, not code. EAV does not have one, everything is in code (implementers keep track of the objects and attributes). Obviously a catalogue eases the addition of columns, navigation, and allows utilities to be formed.
6NF when understood, provides the true solution to The Null Problem. EAV implementers, since they are absent the 6NF context, handle missing data in code, inconsistently, or worse, allow Nulls in the database. 6NF implementers disallow Nulls, and handle missing Data consistently and elegantly, without requiring code constructs (for Null handling; you still have to code for missing data of course).
Eg. For 6NF databases with a catalogue, I have a set of procs that will [re]generate the SQL required to perform all SELECTs, and I provide Views in 5NF for all users, so they do not need to know or understand the underlying 6NF structure. They are driven off the catalogue. Thus changes are easy and automated. EAV types do that manually, due to the absence of the catalogue.
Discussion
Now, we can start the discussion.
"Of course it can be more abstract if
value's are predefined (Example:
specialities could have their own
list)"
Sure. But do not get too "abstract". Maintain consistency and implement such lists in the same EAV (or 6NF) manner as other lists.
"If I take the abstract approach it
can be very flexible, but queries will
be more complex with a lot of joins.
But I don't know if this affects the
performance, executing these 'more
complex' queries."
Joins are pedestrian in relational databases. The problem is not the database, the problem is that SQL is cumbersome when handling joins, especially compound keys.
EAV and 6NF databases have more Joins, which just as pedestrian, no more, no less. If you have to code each SELECT manually, sure, the cumbersome gets really cumbersome.
The entire problem can be eliminated by (a) going with 6NF over EAV and (b) implementing a catalogue, from which you can (c) generate all the basic SQL. Eliminates an entire class of errors as well.
It is a common myth that Joins somehow have a cost. Totally false.
The join is implemented at compile time, there is nothing of substance to 'cost' CPU cycles.
The issue is the size of tables being joined, not the cost of the Join between those same tables.
Joining two tables with millions of rows each, on a correct PK⇢FK relation, each of which have the appropriate indices
(Unique on the parent [PK] side; Unique on the Child side [PK=parent FK + something]
is instantaneous
Where the Child index is not unique, but at least the leading columns are valid, it is slower; where there is no useful index, of course it is very slow.
None of it has to do with Join cost.
Where many rows are returned, the bottleneck will be the network and the disk layout; not the join processing.
Therefore you can get as "complex" as you like, there is no cost, SQL can handle it.
I would be interested to know what are
the up and downsides of both methods.
I can just imagine for myself, but I
don't have the experience to confirm
this.
5NF (or 3NF for those who have not made the progression) is the easiest and best, in terms of implementation; ease of use (developers as well as users); and maintenance.
The drawback is, every time you add a column, you have to change the database structure (table DDL). That is fine is some cases, but not in most cases, due to change control in place, quite onerous.
Second, you have to change existing code (code handling the new column does not count, because that is an imperative): where good standards are implemented, that is minimised; where they are absent, the scope is unpredictable.
EAV (which is what you have posted), allows columns to be added without DDL changes. That is the single reason people choose it. (code handling the new column does not count, because that is an imperative). If implemented well, it will not affect existing code; if not, it will.
But you need EAV-capable developers.
When EAV is implemented badly, it is abominable, a worse mess than 5NF done badly, but not any worse than Unnormalised which is what most databases out there are (misrepresented as "denormalised for performance").
Of course, it is even more important (than in 5NF/3NF) to hold a strong Transaction context, because the columns are far more distributed.
Likewise, it is essential to retain Declarative Referential Integrity: the messes I have seen were due in large part to the developers removing DRI because it became "too hard to maintain", the result was, as you can imagine, one mother of a data heap with duplicate 3NF/5NF rows and columns all over the place. And inconsistent Null handling.
There is no difference in performance, assuming that the server has been reasonably configured for the intended purpose. (Ok, there are specific optimisations that are possible only in 6NF, which are not possible in other NFs, but I think that is outside the scope of this thread.) And again, EAV done badly can cause unnecessary bottlenecks, no more so than Unnormalised.
Of course, if you go with EAV, I am recommending more formality; buy the full quid; go with 6NF; implement a catalogue; utilities to produce SQL; Views; handle Missing Data consistently; eliminate Nulls altogether. This reduces your vulnerability to the quality of your developers; they can forget about the EAV/6NF esoteric issues, use Views, and concentrate on the app logic.
In your question, you have presented at least two major issues at the same time. Those two issues are E-A-V and gen-spec.
First, let's talk about E-A-V. Your last table (object_id, field_id, value) is essentially an E-A-V. There is an upside to E-A-V and a downside to E-A-V. The upside is that the structure is so generic that it can accomodate almost any body of data describing almost any subject matter. That means that you can proceed to design and implementation with no data analysis and no understanding of the subject matter, and not worry about wrong assumptions. The down side is that at retrieval time, you have to do the data analysis that you skipped over before building the data base, in order to come up with queries that mean anything. This is much more serious than just retrieval efficiency. But you are also going to have terrible problems with retrieval efficiency. There are only two ways to learn about this pitfall: live through it or read about it from those who have. I recommend the reading.
Second, you have a gen-spec case. Your table (object_id, type_id) captures a gen-spec (generalization-specialization) pattern, along with the related tables. If I had to generalize between hotels and restaurants, I might call it something like "public accomodations" or "venues". But I'm not sure I understand your case, and you may be driving for something even more general than those two names suggest. After all, you've included "events" in your list, and an event is not a type of venue in my mind.
I've referred other people to readings on gen-spec and the relational model in previous responses.
When two tables are very similar, when should they be combined?
But I hesitate to send you off in the same direction, because it's not clear to me that you want to come up with a relational model of the data before building your database. A relational model of a body of data and an E-A-V model of the same data are almost totally at odds with each other. It seems to me you have to make that choice before you even explore how to express gen-spec in the relational model of data.
When you start to require a large number of different entities (or even before...), a nosql solution would be vastly simpler than either choice.
Just store each entity/record with the exact fields you require.
{
"id": 1,
"type":"Restaurant",
"name":"Messy Joe",
"address":"1 Main St.",
"tags":["asian","fusion","casual"]
}
The "abstract" approach is better known as "Normalization", looks like 3rd Normal Form (3NF).
The other one is called "Denormalized", and can be a valid performance option... when you've encountered speed issues using the Normalized approach, not before.
How do you have the listings represented in code? I'd guess Listing as a supertype, with Shop, Restuarant, etc. as subtypes?
Assuming so, this is a case of how to map subtypes to a relational database. There are generally three choices:
Option 1: single table per subtype,
with common attributes repeated in
each table (name, id, etc).
Option 2: single table for all objects (your single table approach)
Option 3: table for the supertype and one for each subtype
There's no universally correct solution. My preference is generally to start with option 3; it provides an intituitive structure to work with, is pretty well normalised and can easily be extended. It means a single join for retrieving each instance - but RDBMS are well optimised for doing joins so it doesn't really cause performance problems in practice.
Option 2 can be more performant for queries (no joins) but causes problems if other tables need to refer to all supertype instances (proliferation of foreign keys).
Option 1 appears at first sight to be the most performant, although 2 caveats: (1) It's not resilient to change. If you add a new subtype (and so different attributes) you'll need to change the table structure and migrate it. (2) It can be less efficient than it seems. Because the table population is sparse, some DBs don't store it particularly efficiently. As a consequence it can be less efficicent than option 1 - since the query engine can do joins faster than it can search bloated sparse table spaces.
Which to choose really comes down to knowing details of your problem. I'd suggest reading up a bit on the options: this article is a good place to start.
hth

In SQL can I return a tables with a varying number of columns

I have a somewhat more complicated scenario, but I think it should be possible.
I have a large SPROC whose result is a set of characteristics for a set of persons.
So the Table would look something like this:
Property |&#9 Client1 Client 2 Client3
-----------------------------------------------------------
Sex |&#9 M F M
Age |&#9 67 56 67
Income |&#9 Low Mid Low
It's built using cursors, iterating over different datasets.
The problem I am facing is that there is a varying number of Clients and Properties, so an equally valid result over different input sets might be:
Property |&#9 Client1 Client 2
-------------------------------------------
Sex |&#9 M F
Age |&#9 67 56
Weight |&#9 122 122
The different number of properties is easy, those are just extra rows.
My problem is that I need to declare a temporary table with a varying number of columns.
There could be 2 clients or 100. Every client in guaranteed to have every property ultimately listed.
What SQL structure would statisfy this and how can I declare it and insert things into it?
I can't just flip the columns and rows either because there is a variable number of each.
First, you should consider normalizing your design like so:
Create Table ClientAttributes
(
ClientId ....
, Sex Char(1)...
, Age int...
, Income...
)
Second, the SQL language in general is not geared for dynamic column generation. To achieve this, you have to build the SQL statement at runtime (a.k.a. dynamic SQL) as a string. It is best to do this in the middle-tier or a reporting engine rather than in T-SQL.
Third, an infinitely flexible design where you have no idea as to the number or types of properties or instances isn't a design at all. Each table represents a collection of entities with known attributes. They are not wads of arbitrary data. The attributes (columns) need to be known at design time or you risk a Cthulhu design where nothing but chaos reins.
Will Hughes' suggestion is to use an Entity-Attribute-Value design (a.k.a. EAV). It takes a tremendous amount of discipline to do an EAV correctly. It only works if it is treated solely as a wad of data. I.e., no developer can ever filter on any specific attribute (i.e. hard code a query looking for a specific attribute name), ever calculate on the values and you will never be able to ensure consistency amongst specific values. If you can maintain that discipline, then an arbitrary wad of data as a limited portion to a larger design can work. However, once you decide to hard code a query looking for a specific attribute you have gone down the dark path and forever will it dominate your destiny. As the database grows, the performance will degrade significantly. You will have no data integrity (e.g. you have two attributes called "Age" and "Client Age" with some values being integers and others being text). EAVs can be a nightmare to maintain. It is far better for maintenance, reporting, querying etc. to have a normalized design.
This looks like you're using Pivoting on the server*.
While you CAN do this, it's going to be pretty awkward to deal with - and I don't know of any ORM that will work that well for you (if you go down that path).
Instead of Pivoting, how about arranging it more like(*2):
Client | Property | Value
------------------------------
Client 1| Sex | M
Client 1| Age | 67
This way you can still do that per-Client pivoting in your application for display purposes.
(* FWIW: You do know SQL Server 2005+ has PIVOT Commands, right? To save you using cursors.)
(*2 This is just a possible approach. It's hacky, and Thomas's recommendation to normalise your schema is a much better and (likely) more efficient option.)

Is this a bad approach to database design?

I have to build an application for my university that will count as course credit for a Class that lasts 1 month. In this application I have to have a way for users to save a Teacher Class Followup Evaluation, which is a person goes to the classroom and checks out the teacher and ticks certain columns.
An example would be:
Pedagogical Aspects:
Show order and follows class sequence: YES NO Observations
Gives clear examples: YES NO Observations
Involves students in discussion: YES NO Observations
If the user (the one evaluating) chooses YES, then nothing is written in Observations, but if he chooses NO, he has to write observations without fail.
How could I handle this in my database? I'm having doubts about over normalizing. :x Any suggestion would be welcome at this point before I move on with the project.
My plan as of now is to just have a big table called Followup that has all these 'aspects' with a BIT datatype in Microsoft SQL and have a ShowOrderSequenceObservation field for every aspect that can be null. O_O I feel dirty just thinking about so I turn to you fellow developers. Thank you!
I would do something like this:
Table for the actual record - note that this is an anonymous recording from the student perspective
| record_id | question_id | YESNO | observation | teacher_id |
Table of questions.
| question_id | question_string |
Table of teachers:
| teacher_id | teacher_string |
In the general flow of things, I would also update the student table to note "has recorded" and insert the answers all in one transaction. This would preserve student anonymity yet also get the data in.
edit - I have no idea how I would ORM this thing. If I was developing it, I'd hack it out in 10-30 hours with perl and direct sql access. Most of the time would be spent beating on HTML formatting.
Sounds like the age old question of time vs quality. A denormalised table would certainly be fast and easy, but a normalised one with category and question tables would allow flexibility. You uni could use it for other things, allow new question types to be set up on the fly etc, and could get you a better grade.
If you think you can get what you want with a denormalised table, I'd go that way. It's not a production system and business needs aren't going to change in its lifetime. But if you want to push for the blue ribbon solution, I'd normalise it.
BTW, adding a < br > at the end of each option makes it more readable.
You know normalization isn't just for large enterprise level database (I know you know :). History has shown that if you don't normalize you will get anomalies. Start with 5NF and 'optimize' from there, though I suspect you will find that optimization is not required.
I suspect the propsed design will not suit its intended purpose e.g. data analysis. Try writing some typical SQL queries against it (e.g. average length of Observations across all Pedagogical questions then across all questions) and you will find it a pain: huge CASE statements, tables UNIONed many times over, ... it's likely you will end up writing VIEWs to normalise the data!

Database design for a survey [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I need to create a survey where answers are stored in a database. I'm just wondering what would be the best way to implement this in the database, specifically the tables required. The survey contains different types of questions. For example: text fields for comments, multiple choice questions, and possibly questions that could contain more than one answer (i.e. check all that apply).
I've come up with two possible solutions:
Create a giant table which contains
the answers for each survey
submission. Each column would
correspond to an answer from the
survey. i.e. SurveyID, Answer1,
Answer2, Answer3
I don't think this is the best way
since there are a lot of questions
in this survey and doesn't seem very
flexible if the survey is to change.
The other thing I thought of was
creating a Question table and Answer
table. The question table would
contain all the questions for the
survey. The answer table would contain
individual answers from the survey,
each row linked to a question.
A simple example:
tblSurvey: SurveyID
tblQuestion: QuestionID, SurveyID, QuestionType, Question
tblAnswer: AnswerID, UserID, QuestionID, Answer
tblUser: UserID, UserName
My problem with this is that there
could be tons of answers which would
make the Answer table pretty huge.
I'm not sure that's so great when it
comes to performance.
I'd appreciate any ideas and suggestions.
I think that your model #2 is fine, however you can take a look at the more complex model which stores questions and pre-made answers (offered answers) and allows them to be re-used in different surveys.
- One survey can have many questions; one question can be (re)used in many surveys.
- One (pre-made) answer can be offered for many questions. One question can have many answers offered. A question can have different answers offered in different surveys. An answer can be offered to different questions in different surveys. There is a default "Other" answer, if a person chooses other, her answer is recorded into Answer.OtherText.
- One person can participate in many surveys, one person can answer specific question in a survey only once.
My design is shown below.
The latest create script is at https://gist.github.com/durrantm/1e618164fd4acf91e372
The script and the mysql workbench.mwb file are also available at
https://github.com/durrantm/survey
Definitely option #2, also I think you might have an oversight in the current schema, you might want another table:
+-----------+
| tblSurvey |
|-----------|
| SurveyId |
+-----------+
+--------------+
| tblQuestion |
|--------------|
| QuestionID |
| SurveyID |
| QuestionType |
| Question |
+--------------+
+--------------+
| tblAnswer |
|--------------|
| AnswerID |
| QuestionID |
| Answer |
+--------------+
+------------------+
| tblUsersAnswer |
|------------------|
| UserAnswerID |
| AnswerID |
| UserID |
| Response |
+------------------+
+-----------+
| tblUser |
|-----------|
| UserID |
| UserName |
+-----------+
Each question is going to probably have a set number of answers which the user can select from, then the actual responses are going to be tracked in another table.
Databases are designed to store a lot of data, and most scale very well. There is no real need to user a lesser normal form simply to save on space anymore.
As a general rule, modifying schema based on something that a user could change (such as adding a question to a survey) should be considered fairly smelly. There's cases where it can be appropriate, particularly when dealing with large amounts of data, but know what you're getting into before you dive in. Having just a "responses" table for each survey means that adding or removing questions is potentially very costly, and it's very difficult to do analytics in a question-agnostic way.
I think your second approach is best, but if you're certain you're going to have a lot of scale concerns, one thing that has worked for me in the past is a hybrid approach:
Create detailed response tables to store per-question responses as you've described in 2. This data would generally not be directly queried from your application, but would be used for generating summary data for reporting tables. You'd probably also want to implement some form of archiving or expunging for this data.
Also create the responses table from 1 if necessary. This can be used whenever users want to see a simple table for results.
For any analytics that need to be done for reporting purposes, schedule jobs to create additional summary data based on the data from 1.
This is absolutely a lot more work to implement, so I really wouldn't advise this unless you know for certain that this table is going to run into massive scale concerns.
The second approach is best.
If you want to normalize it further you could create a table for question types
The simple things to do are:
Place the database and log on their own disk, not all on C as default
Create the database as large as needed so you do not have pauses while the database grows
We have had log tables in SQL Server Table with 10's of millions rows.
No 2 looks fine.
For a table with only 4 columns it shouldn't be a problem, even with a good few million rows. Of course this can depend on what database you are using. If its something like SQL Server then it would be no problem.
You'd probably want to create an index on the QuestionID field, on the tblAnswer table.
Of course, you need to specify what Database you are using as well as estimated volumes.
You may choose to store the whole form as a JSON string.
Not sure about your requirement, but this approach would work in some circumstances.
Looks pretty complete for a smiple survey. Don't forget to add a table for 'open values', where a customer can provide his opinion via a textbox. Link that table with a foreign key to your answer and place indexes on all your relational columns for performance.
Number 2 is correct. Use the correct design until and unless you detect a performance problem. Most RDBMS will not have a problem with a narrow but very long table.
Having a large Answer table, in and of itself, is not a problem. As long as the indexes and constraints are well defined you should be fine. Your second schema looks good to me.
Given the proper index your second solution is normalized and good for a traditional relational database system.
I don't know how huge is huge but it should hold without problem a couple million answers.