questionaire time spend mysql - sql

Say I have a table where I would store questions.
Now I would like to track how much time people on average spend per question and how many came up with the right solution.
Would I store the time spend per question in the table_questions itself or in a different one.
Would I store the answered right in the table_questions or in a seperate one, maybe even with time spend.
The reason why I am hesitating is two fold. First off I rather not want the user to be able to perform update queries on my questions. But seperating the time spend and "answered good" in a different table seems weird to me because they are inherent to the question?
Does anyone with normalization talent (unlike me) know what would be a good approach?

My suggestion:
Don't name tables TABLE_QUESTIONS or TABLE_USERS or anything similar, unless you have a good reason, and I cannot think of one at the moment. Just call them QUESTIONS and USERS.
If you actually have a USERS table, and you care who answers correctly (I cannot tell based on the wording of the question), then I think you should also have a USER_QUESTIONS table. The tables might look like this:
QUESTIONS
---------
Question_Id
Question_Descr
USERS
-----
User_Id
User_Name
USER_QUESTIONS
--------------
Question_Id
User_Id
Answer
Grade
StartTime
EndTime
Then questions (and only questions) go in their own table, and users (and only users) go in their own table. But when a user answers a question, it goes in the mixed table.
You have a many-to-many relationship between users and questions, and creating an intermediate table like this is the normal way of resolving that.

Related

Storing a multiple choice quiz in a database - deciding the schema

I am trying to implement a multiple choice quiz and will want to store all my questions and answers in a SQLite database. I will have many questions, and for each question there will 2 or more possible answers to display.
My question is, how should I store the questions and answers in a database? I have two ideas for a schema (primary key in bold)
as (many to many)
questions (questionID:int , questionString:String, correctAnswerID:int)
answers (answerID:int , answerString:String)
questions_and_answers (questionID, answerID)
2.
questions (questionID:int, questionString:String, correctAnswerID:int)
answers (answerID:int, answerString:String, questionID:int foreign key)
I'm not sure which one is better, or if there is another way?
Maybe questions_and_answers would get very large and cause long retrieval times and memory problems? Then again, I assume question_and_answers would be indexed on the primary keys. In the second schema, answers would be indexed on answerID and not questionID? meaning the search times would go up as the whole table would have to be searched?
There may be ~10,000 - 20,000 answers. (the quiz may be run on a mobile device and questions will need to be shown "instantly")
Note: I don't expect there to much overlap of answers between questions. I wouldn't think the amount of overlap would mean less data being stored, considering the extra space required by the questions_and_answers table
You're second schema is the better one, because it models the actual domain: each question has a set of answers. Even if you can "compress" the data by storing duplicate answers once, it does not match the actual domain.
Down the road you'll want to edit answers. With schema 1, that means first searching if that answer already exists. If it does exist, you then would have to check if any questions still rely on the old answer. If it did not exist, you would still have to check if any other questions relied on that answer, and then either edit that answer in place or create a new answer.
Schema 1 just makes life really hard.
To answer your index questions, you would need to add an index on questionId. Once you have that index, looking up answers for a question should scale.
Now, on a completely different note, why use a database for this? Consider storing them as simple documents in a standard format like json. Anytime you query a question, you will almost always want the answers, and vice versa. Instead of executing multiple queries, you can load the entire document in one step.
If you then find you need more advanced storage (queries, redundancy, etc) you can move to a document database like MongoDB or CouchDB.
It seems deadlock (circular loop) as questionID column is referred as foreign key in answers table and correctAnswerID column is referred as foreign key in questions table.
It's better to create a bit type column in answers table to marked the correct answer and remove correctAnswerID column.

Survey Data Model - How to avoid EAV and excessive denormalization?

My database skills are mediocre at best and I have to design a data model for survey data. I have spent some thoughts on this and right now I feel that I am stuck between some kind of EAV model and a design involving hundreds of tables, each with hundreds of columns (and thousands of records). There must be a better way to do this and I hope that the wise folks on this forum can help me.
My question is: how should I model the answers to survey questions in an RDBMS? Using SQL Server is mandatory. So alternative data storage systems should be excluded from this discussion. (Sure, some should and will be evaluated, but not here please.) I don't need a solution for the entire data model, for now I'm only interested in the Answers part.
I have already searched various forums, but I couldn't really find a solution. If it has already been given elsewhere, please excuse me and provide me with a link so I can read it up.
Some assumptions about the data I have to deal with:
Each survey consists of 1 to n questionnaires
Each questionnaire consists of 100-2,000 questions (please ignore that 2,000 questions really sound like a lot to answer...)
Questions can be of various types: multiple-choice, free text, a number (like age, income, percentages, ...)
Each survey involves 10-200 countries (These are not the respondents. The respondents are actually people in the countries.)
Depending on the type of questionnaire, each questionnaire is answered by 100-20,000 respondents per country.
A country can adapt the questionnaires for a survey, i.e. add, remove or edit questions
The data for one country is gathered in a separate database in that country. There is no possibility for online integration from the start.
The data for all countries has to be integrated later. This means for example, if a country has deleted a question, that data must somehow be derived from what they sent in order to achieve a uniform design across all countries
I will have to write the integration and cleaning software, which will need to work with every country's data
In the end the data needs to be exported to flat files, one rectangular grid per country and questionnaire.
I have already discussed this topic with people from various backgrounds and have not come to a good solution yet. I mainly got two kinds of opinions.
The domain experts, who are used to working with flat files (spreadsheet-style) for data processing and analysis vote for a denormalized structure with loads of tables and columns as I described above (1 table per country and questionnaire). This sounds terrible to me, because I learned that wide tables are to be avoided, it will be annoying to determine which columns are actually in a table when working with it, the database will become cluttered with hundreds of tables (or I even need to set up multiple databases, each with a similar yet a bit differetn design), etc.
O-O-programmers vote for a strongly "normalized" design, which would effectively lead to a central table containing all the answers from all respondents to all questions. This table would either need to contain a column of type sql_variant type or multiple answer columns with different types to store answers of different types (multiple choice, free text, ..). The former would essentially be a EAV model. I tend to follow Joe Celko here, who strongly discourages its use (he calls it OTLT or "One True Lookup Table"). The latter would imply that each row would contain null cells for the not applicable types by design.
Another alternative I could think of would be to create one table per answer type, i.e., one for multiple-choice questions, one for free text questions, etc.. That's not so generic, it would lead to a lot of union joins, I think and I would have to add a table if a new answer type is invented.
Sorry for boring you with all this text and thank you for your input!
Cheers,
Alex
PS: I asked the same question here: http://www.eggheadcafe.com/community/aspnet/13/10242616/survey-data-model--how-to-avoid-eav-and-excessive-denormalization.aspx
Well imgur is down so i'll post the pic later.
I think this is completely feasible within a relational model. I've built a CDM to show how I would do this.
Outbound
It takes 4 entities to define a Country's Survey. Some Parent Survey, the country and a list of questions. Your questions have an internal relationship so when one country "edits" a question, you can track both the question asked by the country and the question it came from. The other thing you need is a Possible Answer entity/table. Each question may have an associated list of possible answers (multiple choice or ranges etc). Those 4 should completely define the "OUTBOUND" side of this.
Inbound
The "INBOUND" side is just 2 new entities, The Respondent and the answer. The respondent is straightforward, just the demographics of that person if you know them and here you can include a relationship back to country. Each respondent answered the survey in a given country. (Person may be 1:n with Respondent if the person travels or has dual citizenship)
The answer is basic; either it is one of the choices listed in the list of Possible Answers or it is provided. Don't get all caught up in the fact that the answer may be a number, date, etc just yet. Either it's a FK or a string of characters.
Reporting
A report is a join over all of these... You'll choose a country and a survey, get the list of questions and answers.
Answer Complexity
Depends on where you want to do your calculations. If you used a Varchar2(4000) column for your user-provided answers, you could add an attribute to question to describe the datatype of the answer. Q: Age? DT: Integer Between (0 and 130). Then your integration layer can do the validation instead of the database enforcing it. Or you can have 4 columns, one for number, date, character and CLOB. And your integration layer will determine the column to use. When you report those answers out, you'll just select all four columns with Coalesce().
Is this an EAV because there's a slight ambiguity to the datatype of "Answer"
No, it's not.
AN EAV model breaks down an Entity into a list of attributes.
like so:
Entity Attribute Value
1 Fname Stephanie
1 Lname Page
1 Age 30
because you see the Answer column of the Survey schema is holding both words and numbers like the Value column does here you think that defines EAV. It does not. Just as if I added 3 datatype columns to this model it wouldn't change it FROM an EAV.
I soooo hate it when
I've had people tell me that the query I'm tuning has to go "as fast as possible". Ok, so give me a billion dollars and 30 years. "Wait, a Billion what?" "As much as", "as fast as" aren't requirements. You can validate anything you want in a database... build a shedload of Before triggers, voila! Validation galore.
What's the datatype of an age column? Or Birthdate column? Depends on what your data source is. Some older records may only have Month and Year, or just year, or 'around' or 'circa' some year. You couldn't have just a number column and do 'as much validation as possible'. and NUMBER(2) may be BETTER validation than just NUMBER. So now you'll have NUMBER(1), NUMBER(2), NUMBER... to have "as much as".
Where I think you are getting tripped up
Think of this as a Conceptual Data Model, not a Physical one. In those terms Survey is an entity. Is Question an entity or just an attribute of Survey. If you built One table PER you're clearly saying that Question is just an Attribute of Survey and storing them vertically makes this an EAV. What this model shows is that Question is actually another entity. There is a relationship between Questions, e.g. 'a country [can] edit questions'. There was the original question and edited one. Each question has a collection of possible answers. And the most important this is that, they are all questions. In an EAV I call fname, lname, bdate, age, major, salary, etc... all very disparate things, just attributes. In this case we're not including the name of the agency who originated the survey and the date it was issued and the date is due back and the etc... as questions.
Let me put this another way. You're Fedex. You want to store timestamps for certain events. Each time a package enters or leaves a facility or vehicle. Time on the picking up truck, time off the truck and into the first facility, time out of that facility and onto a plane, etc. Do you store them Horizontally? How do you know the number of hops in advance? If you store them vertically does that automatically make it an EAV? And if so why.
You're a weather company getting temps from stations around the country. Let's say the sensors are designed to send a reading when the temperature changes +/- a full degree. If you store a sensor_ID|timestamp|temp is a Reading Table is that an EAV? Each reading isn't an attribute of the sensor, they are themselves entities which belong to a collection/series.
One thing that vertical storage of answers has in common with an EAV is its difficulty in performing analytic queries. If you wanted a list of all the people who answered TRUE to question 5 and 10 but FALSE to 6 and 11 would be very difficult when done vertically. Maybe that's why you see this an EAV. If you want to do that, you need a different storage. The relational storage of the question and answers isn't the best reporting database. Let's go back to the Fedex example. It's not simple to do "transit" time reporting when the rows are vertical.
This sounds like you are wrestling with a common problem: how to use a hammer to fasten a screw.
Both alternatives you listed are bad, each for different reasons. But that's because you are trying to stuff your particular data model into a relational database system. A good approach would be to look beyond the relational database at some other database/storage systems, try a couple out, and find the best fit for your project.
I have tried the EAV model and gave up because it was far too complex, and I am afraid to try the multi-tables model with a relational database system. The easiest solution I have found with a relational database is: store each complete response as a single CLOB, serialized into JSON or YAML (or something else lightweight), in a responses table.
create table responses (
id uuid primary key,
questionnaire_id uuid references questionnaires.id,
data text
)
If I was using SQL Server, Express will be OK, then I would do this:
Table with list of questions, flags
for type (bit), if required flag
(bit), the correct answer if exists,
etc
Table with list of countries
Table linking of countries and
questions (some countries may not get some questions
Table for answers with columns for
the question(s) and a xml
column for the optional questions
including those which are added
If you are not versed in shredding XML then use sparse columns for all the optional questions. I do not recall exactly the limit on the number of sparse columns in a table but I believe it is above 30,000. SQL Server internally stores sparse columns as XML and will shred it when one selects the column and yes it can be indexed
The diagram below show a diagram created with SQL Server. the column AL_A4 will hold the answer to QL_Id = 4 and is of type sparse. The QL_Id in the QuestionList table is not flagged required letting you know to make the column in AnswerList sparse.
Since countries will add questions create QuestionListCustom, QuestiontoCountryCustom and AnswerListCustom tables and add the information from the custom questions.
I am sure there are other ways to design the storage, this is the way I would turn in the homework, if this is not homework then you surely work for the UN.
Have you considered not reinventing the wheel? There are open source survey applications already built. Even if they don't meet your needs, download a few and check out their data models.

How to model table hierarchy

I'm trying to make an application about formula 1. I have three tables: Team, Driver, And Race Results. I'm thinking about three options (and maybe I'm missing more):
Have a derived table Driver_Team. Have a Driver_TeamId in that table. Use that Driver_TeamId in the Race Results table. This seems to solve most of the queries I think I am going to use, but feels awkward and I haven't seen it anywhere.
Have Driver.DriverId and Team.TeamId in the Race Results table. This has the problem of not being able to add extra information. I don't know yet what information, maybe the date of the start of joining a new team. Then I would need a junction table (because that information is not Race Result related).
The last one: Have a junction table Driver_Team, but have only the Driver.DriverId as Foreign Key in the Race Results table. Problem is, queries like "How much points did team x get in season y/several seasons" really really horrible.
Am I missing another solution? If yes, please tell me! :-) Otherwise, which of these solutions seems the best?
Thanks!
Your first option gets my vote. I'd also suggest adding a Race table (to hold data such as track, date, conditions, etc.), and make Race_Results the combination of Driver_Team and Race.
I suggest the following:
RaceResult - Driver - DriverTeam - Team
Where RaceResult contains race_date, DriverTeam contains ( driver_id, team_id, team_join_date and team_leave_date ). Then you would be able to get all the info you're asking about in your question, even though the queries may be complicated.
Just brainstorming, one object model may look like this. Note the conspicuous lack of an "id" field on RaceResult, as the finishing position acts perfectly as a natural key (one driver per finishing position). Of course, there may be lots of other options as well.
Team:
id
name
Driver:
id
name
team_id
Race:
id
venue
date
RaceResults:
position
driver_id
race_id
For the kind of queries you're talking about, I think DriverId and TeamId should both be in RaceResults. If you want to store additional information about an association between a driver and a team, then that should be placed in a separate table. This appears to create a little bit of redundancy, since the driver/team pair in the race table will be limited by the employment dates in the DriverTeam table, but given the complexities of contracts and schedules, I think it may end up being not especially redundant.
I like the way you are planning the DB to support your queries. I have run into way too much OOP thinking in DB design over the years!
If you only store DriverId and TeamId in the RaceResults table, then you cannot associate a driver to a team without a RaceResult.

Database structure for voting system with up- and down votes

I am going to create a voting system for a web application and wonder what the best way would be to store the votes in the (SQL) database.
The voting system is similiar to the one of StackOverflow. I am pondering now if I should store the up and down votes in different tables. That way it is easier to count all up votes resp. down votes. On the other hand I have to query two tables to find all votes for an user or voted item.
An alternative would be one table with a boolean field that specifies if this vote is an up or down vote. But I guess counting up or down votes is quite slow (when you have a lot of votes), and an index on a boolean field (as far as I know) does not make a lot of sense.
How would you create the database structure? One or two tables?
Regarding the comments, we found the solution that best fits to Zardoz
He does not want to always count votes and needs as much details as possible. So the solution is a mix of both.
Adding an integer field in the considered table to store vote counts (make sure there won't be overflows).
Create additional tables to log the votes (user, post, date, up/down, etc.)
I would recommend to use triggers to automatically update the 'vote count field' when inserting/deleting/updating a vote in the log table.
If your votes are just up/down then you could make a votes table linking to the posts and having a value of 1 or -1 (up / down). This way you can sum in a single go.
https://meta.stackexchange.com/questions/1863/so-database-schema
Worth a look or
http://sqlserverpedia.com/wiki/Understanding_the_StackOverflow_Database_Schema
You will need a link table between users and the entities which are being voted on, I would have thought. This will allow you to see which users have already voted and prevent them from submitting further votes. The table can record in a boolean whether it is an up or down vote.
I would advise storing in the voted entity a current vote tally field to ease querying. The saving in size would be negligible if you omitted this.

What is the best way to store a threaded message list/tree in SQL?

I'm looking for the best way to store a set of "posts" as well as comments on those posts in SQL. Imagine a design similar to a "Wall" on Facebook where users can write posts on their wall and other users can comment on those posts. I need to be able to display all wall posts as well as the comments.
When I first started out, I came up with a table such as:
CREATE Table wallposts
(
id uuid NOT NULL,
posted timestamp NOT NULL,
userid uuid NOT NULL,
posterid uuid NOT NULL,
parentid uuid NOT NULL,
comment text NOT NULL
)
id is unique, parentid will be null on original posts and point to an id if the row is a comment on an existing post. Easy enough and super fast to insert new data. However, doing a select which would return me:
POST 1
COMMENT 1
COMMENT 2
POST 2
COMMENT 1
COMMENT 2
Regardless of which order the rows existed in the database proved to be extremely difficult. I obviously can't just order by date, as someone might comment on post 1 after post 2 has been posted. If I do a LEFT JOIN to get the parent post on all rows, and then sort by that date first, all the original posts group together as they'd have a value of null.
Then I got this idea:
CREATE TABLE wallposts
(
id uuid NOT NULL,
threadposted timestamp,
posted timestamp,
...
comment text
)
On an original post, threadposted and posted would be the same. On a comment, timestamp would be the time the original post was posted and "posted" would be the time the comment on that thread was posted. Now I can just do:
select * from wallposts order by threadposted, posted;
This works great, however one thing irks me. If two people create a post at the same time, comments on the two posts would get munged together as they'd have the same timestamp. I could use "ticks" instead of a datetime, but still the accuracy is only 1/1000 of a second. I could also setup a unique constraint on threadposted and posted which makes inserts a bit more expensive, but if I had multiple database servers in a farm, the chance of a collision is still there. I almost went ahead with this anyway since the chances of this happening are extremely small, but I wanted to see if I could eat my cake and still have it too. Mostly for my own educational curiosity.
Third solution would be to store this data in the form of a graph. Each node would have a v-left and v-right pointer. I could order by "left" which would traverse the tree in the order I need. However, every time someone inserts a comment I'd have to re balance the whole tree. This would create a ton of row locking, and all sorts of problems if the site was very busy. Plus, it's kinda extreme and also causes replication problems. So I tossed this idea quickly.
I also thought about just storing the original posts and then serializing the comments in a binary form, since who cares about individual comments. This would be very fast, however if a user wants to delete their comment or append a new comment to the end, I have to deserialize this data, modify the structure, then serialize it back and update the row. If a bunch of people are commenting on the same post at the same time, I might have random issues with that.
So here's what I eventually did. I query for all the posts ordered by date entered. In the middle ware layer, I loop through the recordset and create a "stack" of original posts, each node on the stack points to a linked list of comments. When I come across an original post, I push a new node on the stack and when I come across a comment I add a node to the linked list. I organize this in memory so I can traverse the recordset once and have O(n). After I create the in-memory representation of the wall, I traverse through this data structure again and write out HTML. This works great and has super fast inserts and super fast selects, and no weird row locking issues; however it's a bit heavier on my presentation layer and requires me to build an in memory representation of the user's wall to move stuff around so it's in the right order. Still, I believe this is the best approach I've found so far.
I thought I'd check with other SQL experts to see if there's a better way to do this using some weird JOINS or UNIONS or something which would still be performant with millions of users.
I think you're better off using a simpler model with a "ParentID" on Comment to allow for nesting comments. I don't think it's usually a good practice to use datetimes as keys, especially in this case, where you don't really need to, and an identity ID will be sufficient. Here's a basic example that might work:
Post
----
ID (PK)
Timestamp
UserID (FK)
Text
Comment
-------
ID (PK)
Timestamp
PostID (FK)
ParentCommentID (FK nullable) -- allows for nested comments
Text
Do you want people to be able to comment on other comments, i.e. does the tree have infinite depth?
If you just want to have posts and then comments on those posts then you were on the right lines to start with and I believe the following SQL would meet that requirement (Untested so may be typos)
SELECT posts.id,
posts.posted AS posted_at,
posts.userid AS posted_by,
posts.posterid,
posts.comment AS post_text,
comments.posted AS commented_at,
comments.userid AS commented_by,
comments.comment AS comment_text
FROM wallposts AS posts
LEFT OUTER JOIN wallposts AS comments ON comments.parent_id = posts.id
ORDER BY posts.posted, comments.posted
This technique, a self-join, simply joins the table to itself using table aliases to specify the joins.
You should look into "nested sets". They allow retrieving a hierarchy very easily with a single query.
Here's an article about them
If you are using SQL server 2008, it has built-in support for it through the "hierarchyID" type.
Inserts and updates are more costly and complicated if you don't have the built in support), but querying is much faster and easier.
EDIT:
Damn, missed the part where you already knew about it. (was checking from a mobile phone).
If we stick to your table design … I think you would need some special value in the parentid column to separate original posts from comments (maybe just NULL, if you change definition of that column to nullable). Then, self-join will work. Something like this:
SELECT posts.comment as [Original Post],
comments.comment as Comment
FROM wallposts AS posts
LEFT OUTER JOIN wallposts AS comments
ON posts.id=comments.parentID
WHERE posts.parentID IS NULL
ORDER BY posts.posted, comments.posted
The result set shows Original Post before every comment, and has the right order.
(This was done using SQL Server, so I'm not sure if it works in your environment.)