In a site like StackOverflow should the Question and its Votes be separate tables? - sql

I'm making a site like StackOverflow in Rails but I'm not sure if it's necessary for the Votes on a question to be stored in a separate table in the database.
Is there any good reason to separate the data?
Or could I store the Votes as a single sum in a field of the Questions table?

How would you know if a user voted on a question without keeping a votes table? Or like this website that holds you to X votes a day, how would you know how many votes a user made in the day? How would you keep track of how many up and down votes a user has done? I think good design practices pretty much scream for you to normalize the data and keep a votes table, with perhaps keeping a current +/- denormalized field in the question row for easy fetching.

Yes! Think about it from an object perspective. In model driven development (objects first) you would have a container (table) of questions, and a container of votes. Of course you could simply roll them up to an aggregate form. However by doing that you lose a lot of metric detail such as who cast the vote, when, etc. It really depends on if you need the detail or not. Space is cheap so not keeping the detail is usually not a good idea. It is hard to foresee what is needed in the future!

Think about your data in multiple dimensions. There's more going on than the mere number of votes. There's:
Who cast the vote
When they cast the vote
The effect (think like a financial transaction) of the vote on any number of parties
Can you afford to discard this data? Will you ever need it? In Stackoverflow, it must be known whether I voted on something to determine if I can vote; what the vote was, so I can change it; the effect of the vote so it can be rolled back if I change it; etc.

Votes would also need to be able to be applied to both questions and answers, although both questions and answers could be stored in the one table/class called Post or somthing similar since they are the same data with a different title.

Like the last two answers say: keep a separate votes table.
But it would be advisable to create a view that will aggregate votes per user, per question etc. so that you don't need to do a manual query when you need that info.
jrh

Yes, I would go so far as to say that it is vital to help reduce the likelyhood that one person could bias the result by repeatedly voting something up or down.
It actually has very little to do with OOP, and more to do with preventing exploits.
For performance reasons you could use a static vote count in the questions table that gets updated when the vote data for a question changes. I would not though only use a vote count by itself unless you really don't care about results being biased by particular people.

Related

should this be two database tables or one?

I have a database table called interviews and the interviewer and the interviewee will both have to review how the interview went. The review will have similar fields (rating on a scale) but different questions.
Option 1 is to have them both in the same table and have it be 1..N back to the interview table (storing the ID of the writer and the one being reviewed as well). and only limiting which fields can be input at the application level.
Option 2 is to have two tables (one specifically for interviewer reviews and one specifically for interviewee reviews.
What is your opinion of the best way to model this?
Although this is dangerously close to being opinion-based, I have a comment that is too long for comments.
Handling surveys is rather complicated. Surveys change over time because questions are added, removed, and modified and answers are added, removed, and modified. And yet, people often want to use survey questions and track the results over time.
So, the data model for a survey is much more complicated than "one table" or "two tables". There are tables for surveys, questions, answers, and the relationships and values can change over time.
One big table is often a poor choice. If you index properly and write fine tuned queries, they are going to perform fine. Having multiple table can help you in multiple ways like
Access particular data ,
Easy queries etc
Two review tables. Those are TWO bona fide separate entities.
Here's the deal:
Designing a single table that "works" for two different purposes can be done but it's challenging: on the database level, and on your application.
But then... a few months later new requirements come in, that makes it more challenging. You'll need to implement weird logic to keep using one table. Code becomes convoluted, and testing becomes a nightmare.
And then, more changes come in. It becomes unmanageable. At some point you'll realise they were different things from the start that will EVOLVE differently.
Bottom line, it's better to keep them separate from the start to avoid huge cost in the future. Even if they have near-identical columns in the beginning.

Performance gains vs Normalizing your tables? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Ok ok I know you probably all going to kill me for asking this, however I got into an friendly programmer argument with a co-worker about one of our database tables and he asked a question which I know the answer to but I couldn't explain it is the better way.
I will simplify the situation for the simplicity of the question, We have a fairly large table of people / users. Now amongst other data being stored the data in question is as follows: we have a simNumber, cellNumber and the ipAddress of that sim.
Now I am saying that we should make a table lets call it SimTable and put those 3 entries in the sim table, and then put a FK in the UsersTable linking the two. Why? Because that's what I have always been taught NORMALISE your tables!!! Ok so all is good in that regard.
But now my friend says to me yes, but now when you want to query a users phone number, SQL now has to go and:
search for the user
search for the sim fk
search for the correct sim row in the sim database
get the phone number
Now when I go and request 10000 users phone numbers, the number of operations done seriously grows in size.
Vs the other approach
search for the user
find the phone number
Now the argument is purely performance based. As much as I understand why we do normalize the data (to remove redundant data, maintainability, make changes to data in one table which propagate up etc.. ) It does appear to me that the approach with the data in one table will be faster or will at least less tasks/ operations to give me the data I want?
So what is the case in this situation? I do hope that I have not asked anything insanely silly , it is early in the morning so do forgive me if im not thinking clearly
The technology involved in MS SQL server 2012
[EDIT]
This article below also touches on some pf the concepts I have mentioned above
http://databases.about.com/od/specificproducts/a/Should-I-Normalize-My-Database.htm
The goal of normalization is not performance. The goal is to model your data correctly with minimum redundancy so you avoid data anomalies.
Say for example two users share the same phone. If you store the phones in the user table, you'd have sim number, IP address, and cell number stored one each user's row.
Then you change the IP address on one row but not the other. How can one sim number have two IP addresses? Is that even valid? Which one is correct? How would you fix such discrepancies? How would you even detect them?
There are times when denormalization is worthwhile, if you really need to optimize data access for one query that you run very frequently. But denormalization comes at a cost, so be prepared to commit yourself to a lot more manual work to take responsibility for data integrity. More code, more testing, more cleanup tasks. Do those count when considering "performance" of the project overall?
Re comments:
I agree with #JoelBrown, as soon as you implement your first case of denormalization, you compromise on data integrity.
I'll expand on what Joel mentions as "well-considered." Denormalization benefits specific queries. So you need to know which queries you have in your app, and which ones you need to optimize for. Do this conservatively, because while denormalization can help a specific query, it harms performance for all other uses of the same data. So you need to know whether you need to query the data in different ways.
Example: suppose you are designing a database for StackOverflow, and you want to support tags for questions. Each question can have a number of tags, and each tag can apply to many questions. The normalized way to design this is to create a third table, pairing questions with tags. That's the physical data model for a many-to-many relationship:
Questions ----<- QuestionsTagged ->---- Tags
But you figure you don't want to do the join to get tags for a given question, so you put tags into a comma-separated string in the questions table. This makes it quicker to query a given question and its associated tags.
But what if you also want to query for one specific tag and find its related questions? If you use the normalized design, it's simply a query against the many-to-many table, but on the tag column.
But if you denormalize by storing tags as a comma-separated list in the Questions table, you'd have to search for tags as substrings within that comma-separated list. Searching for substrings can't be indexed with a standard B-tree style index, and therefore searching for related questions becomes a costly table-scan. It's also more complex and inefficient to insert and delete a tag, or to apply constraints like uniqueness or foreign keys.
That's what I mean by denormalization making an improvement for one type of query at the expense of other uses of the data. That's why it's a good idea to start out with everything in normal form, and then refactor to denormalized designs later on a case by case basis as your bottlenecks reveal themselves.
This goes back to old wisdom:
"Premature optimization is the root of all evil" -- Donald Knuth
In other words, don't denormalize until you can demonstrate during load testing that (a) it makes a real improvement to performance that justifies the loss of data integrity, and (b) it does not degrade performance of other cases unacceptably.
It sounds like you already understand the benefits of normalisation, so I won't cover these.
There are a couple of considerations here:
1. Does a user always have one and only phone number?
If so, then it is still normalised to add these to the user table. However, if the user can have either no phone number or multiple phone numbers, then the phone details should be held in a seperate table.
Assuming you have these in seperate tables, but after conducting performance tests you found that joining on these 2 tables was having a significant effect on performance, then you may choose to deliberately denormalise the tables for performance gains.
Others have already provided some good points and you may also want to take a look at this.
I'd just like to mention one more aspect that is often overlooked: I/O tends to be the greatest component of the cost of most queries, and denormalization generally increases the storage size of data, therefore making the DBMS cache "smaller".
If your normalized database fits into cache and denormalized doesn't, you may actually observe a performance decrease for the latter.
And you won't be able to spot that in development, unless you actually have the amount of data that is similar to production. This is one of many reasons why you should never, ever denormalize without solid measurements (on representative amounts of data) to justify it.

Implementing Review flags in Databases; best practices

I need store some review flags that relate to some entities. Each review flag can only related to a single entity property group. For example table Parents has a ParentsStatus flag and table Children has a set of ChildrenStatus flags.
In the current design proposal I have three tables:
ReviewTypes: stores the flags and the properties they relate to.
ReviewPositions: stores the values the flags can have.
Reviews: stores the transaction data, the actual reviews. It is like UsersToFlags: Flags in a database rows, best practices.
The problem is I am getting push back that there is no need to have the Reviews table and it would be better to just store this actual review data on each entity. For example add an extra column to Parents to hold ParentsStatus. They feel it is a simpler solution and separating the data out is just “overkill” for out scenario.
I don’t like this idea as this means that every time we want to add a new review flag we need to update the core entity table to hold that flag.
Space is not a problem.
Do people have any strong opinions?
Edit:
This comment applies to the three answers. The consensus is the relational approach is best but I think I need to read up a little more on the EAV model as from some very basic reading Best beginner resources for understanding the EAV database model? and its related links it does not appear to be super straightforward and I don't want to dig myself a hole. Thanks to wildplasser. I'll loop back once I read up a bit more.
Oh yes. Their idea is simpler, until you want to enhance it. Given the scheme they are proposing what if two reviews were need per entity. What if you wanted to attach other things such as notes/annotations. Once they find out how much of an inflatable dartboard their idea is, what do you have to move to a more useful one? Not to mention you need some way of identifying status fields, with fragile rubbish like Column name ends with "_Status", or you have to hard code them somewhere.
Doing it properly is not that much more work, it's not more complex, in fact in many ways it's simpler and it will cope with the invetible changes at far less cost.
normalization is always preferable to premature optimization.
One reason why I like the reviews table separate is that you can hold changes you may not want to display yet (as it hasn't been reviewed and approved) and still maintain the old dat until the new is approved. I don't know if your situation requires that.
To make future programming simpler for when you want to display the changes, you can write a view that shows the old and new data.

What's the best way to store/calculate user scores?

I am looking to design a database for a website where users will be able to gain points (reputation) for performing certain activities and am struggling with the database design.
I am planning to keep records of the things a user does so they may have 25 points for an item they have submitted, 1 point each for 30 comments they have made and another 10 bonus points for being awesome!
Clearly all the data will be there, but it seems like a lot or querying to get the total score for each user which I would like to display next to their username (in the form of a level). For example, a query to the submitted items table to get the scores for each item from that user, a query to the comments table etc. If all this needs to be done for every user mentioned on a page.... LOTS of queries!
I had considered keeping a score in the user table, which would seem a lot quicker to look up, but I've had it drummed into me that storing data that can be calculated from other data is BAD!
I've seen a lot of sites that do similar things (even stack overflow does similar) so I figure there must be a "best practice" to follow. Can anyone suggest what it may be?
Any suggestions or comments would be great. Thanks!
I think that this is definitely a great question. I've had to build systems that have similar behavior to this--especially when the table with the scores in it is accessed pretty often (like in your scenario). Here's my suggestion to you:
First, create some tables like the following (I'm using SQL Server best practices, but name them however you see fit):
UserAccount UserAchievement
-Guid (PK) -Guid (PK)
-FirstName -UserAccountGuid (FK)
-LastName -Name
-EmailAddress -Score
Once you've done this, go ahead and create a view that looks something like the following (no, I haven't verified this SQL, but it should be a good start):
SELECT [UserAccount].[FirstName] AS FirstName,
[UserAccount].[LastName] AS LastName,
SUM([UserAchievement].[Score]) AS TotalPoints
FROM [UserAccount]
INNER JOIN [UserAchievement]
ON [UserAccount].[Guid] = [UserAchievement].[UserAccountGuid]
GROUP BY [UserAccount].[FirstName],
[UserAccount].[LastName]
ORDER BY [UserAccount].[LastName] ASC
I know you've mentioned some concern about performance and a lot of queries, but if you build out a view like this, you won't ever need more than one. I recommend not making this a materialized view; instead, just index your tables so that the lookups that you need (essentially, UserAccountGuid) will enable fast summation across the table.
I will add one more point--if your UserAccount table gets huge, you may consider a slightly more intelligent query that would incorporate the names of the accounts you need to get roll-ups for. This will make it possible not to return huge data sets to your web site when you're only showing, you know, 3-10 users' information on the page. I'd have to think a bit more about how to do this elegantly, but I'd suggest staying away from "IN" statements since this will invoke a linear search of the table.
For very high read/write ratios, denormalizing is a very valid option. You can use an indexed view and the data will be kept in sync declaratively (so you never have to worry about there being bad score data). The downside is that it IS kept in sync.. so the updates to the store total are a synchronous aspect of committing the score action. This would normally be quite fast, but it is a design decision. If you denormalize yourself, you can choose if you want to have some kind of delayed update system.
Personally I would go with an indexed view for starting, and then later you can replace it fairly seamlessly with a concrete table if your needs dictate.
In the past we've always used some sort of nightly or perodic cron job to calculate the current score and save it in the database - sort of like a persistent view of the SUM on the activities table. Like most "best practices" they are simply guidelines and it's often better and more practical to deviate from a specific hard nosed practice on very specific areas.
Plus it's not really all that much of a deviation if you use the cron job as it's better viewed as a cache stored in the database.
If you have a separate scores table, you could update it each time an item is submitted or a comment is posted by a user. You could do this using a trigger or within the sites code.
The user scores would be updated continuously, and could be quickly queried for display.

Generate webpages directly from database or cache?

[I'm not asking about the architecture of SO, but it would be helpful to the question.]
On SO, when a user clicks on his/her name and clicks on "responses" they see other users responses to comment threads, questions, and answers in which they have participated. I've had the sneaking suspicion that I've missed certain responses out there, which made me wonder: if you had to build that thing, would you pull everything dynamically from the database every time a user requested it? Or would you modify it when there is new related activity in the application? Or would you build it in a nightly daemon process?
I imagine that the real answer is that it's dynamically constructed every time, but that the tables are denormalized in such a way so as to make the thing less time-consuming. How would you build it?
I'm asking about any platform, of course, not only on .Net.
I would pull it dynamically from the database every time. I think this gives you the best result from a user experience and then I would apply the principal that premature optimization is evil. Later if there were performance issues I would look into caching.
I think doing it as a daemon/push process would actually result in more overall work being done. That is the updates would happen more frequently than the users are requesting the info.
Obviously, when an answer or comment is posted, you'll want to identify the user that should be informed in their responses tab. Then just add a row to a responses table containing the response text, timestamp, and the user to which it belongs. That way you can dynamically generate the tab with a simple
select * from responses where user=<userid> order by time desc limit 30
or something like that.
p.s. Extra credit to anyone that can write a query that will remove old responses - assume that each person should have the last 30 responses in their responses tab.
I expect that userid would be a natural option for the clustered index. If you have an "Active" boolean field then you don't need to worry much about locks; the table could be write-only except to update the (unindexed) Active column. I bet it already works that way, since it appears that everything is recoverable.
Don't need no stinking extra-credit response remover.
I would assume this is denormalized in the database. The Comment table probably has both and answer_id and an answer_uid so the SQL to find comments on you answers just run against the comment table. The same setup would work on the Answer table. Each answer has a question_id and a question_uid.
Having said that, these are probably the same table and you have response_to_id and response_to_uid and that makes lots of code simpler and makes the "recent" tab a single select as well. In fact the difference between the two selects is one uses the uid and one uses the response_to_uid.
I'd say that your UI and your database should both be driven by your Application Domain; so they will reflect each other based on their common provenance there.
Some quick notes to illustrate, using simplified Object Role Modeling as discussed by Fowler et al.
Entities
Users
Questions
Answers
Comments
Entity Roles
(Note: In Object Role Modeling, most Roles are reflexive. Some, e.g. booleans here, are monopolar)
Question has User
Question has QuestionVersions
Question as Answers
Question has Comments
Answer has AnswerVersions
Answer has Comments
Question has User
QuestionVersion has Text
QuestionVersion has Timestamp
QuestionVersion has IsDeleted (could be inferred from nonNULL timestamp eg)
QuestionVersion has DeltedByUser
QuestionVersion has DeletedTimestamp
Answer has User
AnswerVersion has Text
AnswerVersion has Timestamp
AnswerVersion has IsDeleted
AnswerVersion has DeltedByUser
AnswerVersion has DeletedTimestamp
Comment has Text
Comment has User
Comment has Timestamp
Comment IsDeleted (boolean)
(note - no versions on comments)
I think that's the basics. These assertions drive ERDs in ORM. Hopefully it's self-evident how they drive the User Stories as well.
I don't think an implementation of a normalized design like this would require denormalization - especially since I think it's clear (from behavior) that queries => UI displays are cached to be refreshed 1X per minute.