Does Stack Exchange's database schema follow good practice? [closed] - sql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
This is somewhat of a meta question, but because it relates to a database design, I thought I should post it here.
I'm building a site that includes Q+A and was wondering how I should structure my SQL database, so naturally, I looked to the best of the best. However, the Stack Exchange database schema seems to defy what I've learned about creating maintainable/extensible table hierarchies.
As you can see, Stack Exchange stores all of its "Posts" in one table, except for comments, which has its own table. Post types include questions, answers, and various wiki things. This results in a lot of NULL columns in the table. For example, questions have titles, tags, and answerCounts, while answers don't, so all answer entries have NULL for all three of those columns. If more post types are added over time, this will progressively become less maintainable. And the fact that comments is the only type of post that has its own table just seems inconsistent.
What I've read states that it's generally preferred to use an object subclass hierarchy, in which there's a generic "Posts" table along with a bunch of tables for each type of post that all have one column that maps back to the corresponding entry in the "Posts" table. This keeps the number of null columns to a minimum and makes it more extensible, but slows down queries because they'll require more joins.
So why does Stack Exchange use this giant table method? Is it just the result of ages of modifications to an old database? More specifically, should I use this model for my own Q+A system or stick with an object subclass hierarchy (my Q+A/forum system will closely resemble SO's, with several types of posts including questions, answers, polls, reviews, etc.)?

This is a classic case of so-called "Object-relational impedance mismatch". Specifically, you are taking about mapping OO's inheritance into a relational database structure. There are several common ways of doing that -
A table per subclass,
A table per leaf subclass, and
A table per class hierarchy (with a discriminator)
Each of these strategies is perfectly valid. Moreover, the structures could be mixed as needed.
It looks like Stack Exchange used a table per class hierarchy approach, with PostTypeId serving as a discriminator. This approach is as valid as any other approach that they could have taken. It is also one of the simplest ones to take from the maintenance standpoint, because it lets you construct manual queries with less work.
There is another thing in the structure of the table that you did not mention: it is not normalized. Specifically, there are AnswerCount and CommentCount fields that store information that could be obtained by aggregating the table (i.e. running a SELECT COUNT(*) FROM ... WHERE ... AND other.ParentId = p.Id ...) This is a common tradeoff between normalization and speed of execution: most likely, the profiling has indicated that the aggregation takes significant amount of time, so the counts have been moved into the "parent" record.

Related

Performance gains vs Normalizing your tables? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Ok ok I know you probably all going to kill me for asking this, however I got into an friendly programmer argument with a co-worker about one of our database tables and he asked a question which I know the answer to but I couldn't explain it is the better way.
I will simplify the situation for the simplicity of the question, We have a fairly large table of people / users. Now amongst other data being stored the data in question is as follows: we have a simNumber, cellNumber and the ipAddress of that sim.
Now I am saying that we should make a table lets call it SimTable and put those 3 entries in the sim table, and then put a FK in the UsersTable linking the two. Why? Because that's what I have always been taught NORMALISE your tables!!! Ok so all is good in that regard.
But now my friend says to me yes, but now when you want to query a users phone number, SQL now has to go and:
search for the user
search for the sim fk
search for the correct sim row in the sim database
get the phone number
Now when I go and request 10000 users phone numbers, the number of operations done seriously grows in size.
Vs the other approach
search for the user
find the phone number
Now the argument is purely performance based. As much as I understand why we do normalize the data (to remove redundant data, maintainability, make changes to data in one table which propagate up etc.. ) It does appear to me that the approach with the data in one table will be faster or will at least less tasks/ operations to give me the data I want?
So what is the case in this situation? I do hope that I have not asked anything insanely silly , it is early in the morning so do forgive me if im not thinking clearly
The technology involved in MS SQL server 2012
[EDIT]
This article below also touches on some pf the concepts I have mentioned above
http://databases.about.com/od/specificproducts/a/Should-I-Normalize-My-Database.htm
The goal of normalization is not performance. The goal is to model your data correctly with minimum redundancy so you avoid data anomalies.
Say for example two users share the same phone. If you store the phones in the user table, you'd have sim number, IP address, and cell number stored one each user's row.
Then you change the IP address on one row but not the other. How can one sim number have two IP addresses? Is that even valid? Which one is correct? How would you fix such discrepancies? How would you even detect them?
There are times when denormalization is worthwhile, if you really need to optimize data access for one query that you run very frequently. But denormalization comes at a cost, so be prepared to commit yourself to a lot more manual work to take responsibility for data integrity. More code, more testing, more cleanup tasks. Do those count when considering "performance" of the project overall?
Re comments:
I agree with #JoelBrown, as soon as you implement your first case of denormalization, you compromise on data integrity.
I'll expand on what Joel mentions as "well-considered." Denormalization benefits specific queries. So you need to know which queries you have in your app, and which ones you need to optimize for. Do this conservatively, because while denormalization can help a specific query, it harms performance for all other uses of the same data. So you need to know whether you need to query the data in different ways.
Example: suppose you are designing a database for StackOverflow, and you want to support tags for questions. Each question can have a number of tags, and each tag can apply to many questions. The normalized way to design this is to create a third table, pairing questions with tags. That's the physical data model for a many-to-many relationship:
Questions ----<- QuestionsTagged ->---- Tags
But you figure you don't want to do the join to get tags for a given question, so you put tags into a comma-separated string in the questions table. This makes it quicker to query a given question and its associated tags.
But what if you also want to query for one specific tag and find its related questions? If you use the normalized design, it's simply a query against the many-to-many table, but on the tag column.
But if you denormalize by storing tags as a comma-separated list in the Questions table, you'd have to search for tags as substrings within that comma-separated list. Searching for substrings can't be indexed with a standard B-tree style index, and therefore searching for related questions becomes a costly table-scan. It's also more complex and inefficient to insert and delete a tag, or to apply constraints like uniqueness or foreign keys.
That's what I mean by denormalization making an improvement for one type of query at the expense of other uses of the data. That's why it's a good idea to start out with everything in normal form, and then refactor to denormalized designs later on a case by case basis as your bottlenecks reveal themselves.
This goes back to old wisdom:
"Premature optimization is the root of all evil" -- Donald Knuth
In other words, don't denormalize until you can demonstrate during load testing that (a) it makes a real improvement to performance that justifies the loss of data integrity, and (b) it does not degrade performance of other cases unacceptably.
It sounds like you already understand the benefits of normalisation, so I won't cover these.
There are a couple of considerations here:
1. Does a user always have one and only phone number?
If so, then it is still normalised to add these to the user table. However, if the user can have either no phone number or multiple phone numbers, then the phone details should be held in a seperate table.
Assuming you have these in seperate tables, but after conducting performance tests you found that joining on these 2 tables was having a significant effect on performance, then you may choose to deliberately denormalise the tables for performance gains.
Others have already provided some good points and you may also want to take a look at this.
I'd just like to mention one more aspect that is often overlooked: I/O tends to be the greatest component of the cost of most queries, and denormalization generally increases the storage size of data, therefore making the DBMS cache "smaller".
If your normalized database fits into cache and denormalized doesn't, you may actually observe a performance decrease for the latter.
And you won't be able to spot that in development, unless you actually have the amount of data that is similar to production. This is one of many reasons why you should never, ever denormalize without solid measurements (on representative amounts of data) to justify it.

Database Design For Users\Downloads [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I need to design a database for something like a downloads site . I want to keep track of users , the programs each users downloaded and also allow users to rate+comment said programs.The things I need from this database - get average rating for a program , get all comments for a program , know exactly what program was downloaded by whom(I dont care how many times each program was downloaded but I want to know for each users what programs he has downloaded),maybe also count number of comments for each program and thats about it(it's a very small project for personal use that I want to keep simple)
I come up with these entities -
User(uid,uname etc)
Program(pid,pname)
And the following relationships-
UserDownloadedProgram(uid,pid,timestamp)
UserCommentedOnProgram(uid,pid,commentText,timestamp)
UserRatedProgram(uid,pid,rating)
Why I chose it this way - the relationships (user downloads , user comments and rates) are many to many . A user downloads many programs and a program is downloaded by many users. Same goes for the comments (A user comments on many programs and a program is commented or rated by many users). The best practice as far as I know is to create a third table which is one to many (a relationship table).
. I suppose that in this design the average rating and comment retrieval is done by join queries or something similar.
I'm a total noob in database design but I try to adhere to best practices , is this design more or less ok or am I overlooking something ?
I can definitely think of other possibilities - maybe comment and\or rating can be an entity(table) by itself and the relationships are between 3 entities. I'm not really sure what the benefits\drawbacks of that are: I know that I don't really care about the comments or the ratings , I only want to display them where appropriate and maintain them(delete when needed) , so how do I know if they better become an entity themselves?
Any thoughts?
You would create new entities as dictated by the rules of normalization. There is no particular reason to make an additional (separate) table for comments because you already have one. Who made the comment and which program the comment applied to are full-fledged attributes of a comment. The foreign keys representing these relationships (which are many-to-one, from the perspective of the comment table) belong right where you've put them.
The tables you've proposed are in third normal form which is acceptable according to best practices. I would add that you seem to be tracking data on a transactional basis (i.e. recording events as and when they occur). That is a good practice too because you can always figure out whatever you want to based on detailed information.
Calculating number of downloads or number of comments is a simple matter of using SQL Aggregate Functions with filters on the foreign key(s) that apply to your query - e.g. where pid=1234 etc.
I would do an entity for Downloads with its own id. You could have download status, you may have multiple download of the same program for one user. you may need to associate your download to an order or something else,..

PostgreSQL + Django: SQL design pattern for multi-type data [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I need to store message data in SQL. Cannot decide which way to go here.
There is a main class Message, say (simplified):
class Message(models.Model):
user_id = models.ForeignKey(User)
text = models.TextField()
Plus, there are other Message classes that inherit this one.
class MmsMessage(Message):
imagedata = models.ForeignKey(ImageData)
And so on. These other message classes of course can have more than 1 additional field.
Now, I am evaluating the best (fastest) design pattern to make this work.
In around 25% of cases I will not be needing additional fields, simply raw Message objects (Message.objects.all). In other cases, I need all data. Additional fields may not necessarily be searchable. Nonetheless, it would be nice thing to have.
I was thinking about:
A: Inheritance (concrete, abstract)
Abstract inheritance is out. I loose the ability to do: Message.objects.all() which is unacceptable.
Concrete inheritance seems to me like a way to go. Tried two approaches. django-model-utils one (select_subclasses) which doesn't need additional queries, but due to lots of inner joins and redundant data in results it is very slow compared to other solutions.
django_polymorphic (still concrete inheritance) approach (using contenttypes to know what we are dealing with and then select related fields) is ~4 times faster than select_subclasses (at least on postgresql) - which was a small surprise for me (it requires +n queries where n is a number of child types but still faster due to simpler joins and no unnecessary data results). Tested on 10 000 objects across 20 different Message child types.
B: EAV model (many to many for additional attributes)
Haven't tested EAV model but I doubt it will be faster than inheritance solution. When I know what column names and types I want, it seems that EAV model loses all its charm.
[UPDATED - horse_with_no_name] B1: hstore - similiar to EAV with many to many but possibly much faster (no joins, backend support)
Great to add dictionary like custom fields.
Downsides: I lose compatibility with other django database backends (I would prefer not to), also it is type-agnostic, key and value is TEXT. I am also worried about making Message table raw queries slower in general due to many TEXT fields in hstore dict.
C: XML field in Message table for additional data
XML field in Message table is something that feels a little fishy to me. What if I dont need these additional fields (from message child types) to be searchable or indexable - is XML field a good solution?
What is the best option in your opinion?
The simple answer here is to just use a single table. However the real question is why you're putting stuff in a database in the first place. If your intent is to scale up to large sizes, then you probably want to look at a hybrid storage model (indexing messages in the DB and storing the raw message in something like hbase).

What mysql database tables and relationships would support a Q&A survey with conditional questions? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm working on a fairly simple survey system right now. The database schema is going to be simple: a Survey table, in a one-to-many relation with Question table, which is in a one-to-many relation with the Answer table and with the PossibleAnswers table.
Recently the customer realised she wants the ability to show certain questions only to people who gave one particular answer to some previous question (eg. Do you buy cigarettes? would be followed by What's your favourite cigarette brand?, there's no point of asking the second question to a non-smoker).
Now I started to wonder what would be the best way to implement this conditional questions in terms of my database schema? If question A has 2 possible answers: A and B, and question B should only appear to a user if the answer was A?
Edit: What I'm looking for is a way to store those information about requirements in a database. The handling of the data will be probably done on application side, as my SQL skills suck ;)
Survey Database Design
Last Update: 5/3/2015
Diagram and SQL files now available at https://github.com/durrantm/survey
If you use this (top) answer or any element, please add feedback on improvements !!!
This is a real classic, done by thousands. They always seems 'fairly simple' to start with but to be good it's actually pretty complex. To do this in Rails I would use the model shown in the attached diagram. I'm sure it seems way over complicated for some, but once you've built a few of these, over the years, you realize that most of the design decisions are very classic patterns, best addressed by a dynamic flexible data structure at the outset.
More details below:
Table details for key tables
answers
The answers table is critical as it captures the actual responses by users.
You'll notice that answers links to question_options, not questions. This is intentional.
input_types
input_types are the types of questions. Each question can only be of 1 type, e.g. all radio dials, all text field(s), etc. Use additional questions for when there are (say) 5 radio-dials and 1 check box for an "include?" option or some such combination. Label the two questions in the users view as one but internally have two questions, one for the radio-dials, one for the check box. The checkbox will have a group of 1 in this case.
option_groups
option_groups and option_choices let you build 'common' groups.
One example, in a real estate application there might be the question 'How old is the property?'.
The answers might be desired in the ranges:
1-5
6-10
10-25
25-100
100+
Then, for example, if there is a question about the adjoining property age, then the survey will want to 'reuse' the above ranges, so that same option_group and options get used.
units_of_measure
units_of_measure is as it sounds. Whether it's inches, cups, pixels, bricks or whatever, you can define it once here.
FYI: Although generic in nature, one can create an application on top of this, and this schema is well-suited to the Ruby On Rails framework with conventions such as "id" for the primary key for each table. Also the relationships are all simple one_to_many's with no many_to_many or has_many throughs needed. I would probably add has_many :throughs and/or :delegates though to get things like survey_name from an individual answer easily without.multiple.chaining.
You could also think about complex rules, and have a string based condition field in your Questions table, accepting/parsing any of these:
A(1)=3
( (A(1)=3) and (A(2)=4) )
A(3)>2
(A(3)=1) and (A(17)!=2) and C(1)
Where A(x)=y means "Answer of question x is y" and C(x) means the condition of question x (default is true)...
The questions have an order field, and you would go through them one-by one, skipping questions where the condition is FALSE.
This should allow surveys of any complexity you want, your GUI could automatically create these in "Simple mode" and allow for and "Advanced mode" where a user can enter the equations directly.
one way is to add a table 'question requirements' with fields:
question_id (link to the "which brand?" question)
required_question_id (link to the "do you smoke?" question)
required_answer_id (link to the "yes" answer)
In the application you check this table before you pose a certain question.
With a seperate table, it's easy adding required answers (adding another row for the "sometimes" answer etc...)
Personally, in this case, I would use the structure you described and use the database as a dumb storage mechanism. I'm fan of putting these complex and dependend constraints into the application layer.
I think the only way to enforce these constraints without building new tables for every question with foreign keys to others, is to use the T-SQL stuff or other vendor specific mechanisms to build database triggers to enforce these constraints.
At an application level you got so much more possibilities and it is easier to port, so I would prefer that option.
I hope this will help you in finding a strategy for your app.

A beginner's guide to SQL database design [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Do you know a good source to learn how to design SQL solutions?
Beyond the basic language syntax, I'm looking for something to help me understand:
What tables to build and how to link them
How to design for different scales (small client APP to a huge distributed website)
How to write effective / efficient / elegant SQL queries
I started with this book: Relational Database Design Clearly Explained (The Morgan Kaufmann Series in Data Management Systems) (Paperback) by Jan L. Harrington and found it very clear and helpful
and as you get up to speed this one was good too Database Systems: A Practical Approach to Design, Implementation and Management (International Computer Science Series) (Paperback)
I think SQL and database design are different (but complementary) skills.
I started out with this article
http://en.tekstenuitleg.net/articles/software/database-design-tutorial/intro.html
It's pretty concise compared to reading an entire book and it explains the basics of database design (normalization, types of relationships) very well.
Experience counts for a lot, but in terms of table design you can learn a lot from how ORMs like Hibernate and Grails operate to see why they do things. In addition:
Keep different types of data separate - don't store addresses in your order table, link to an address in a separate addresses table, for example.
I personally like having an integer or long surrogate key on each table (that holds data, not those that link different tables together, e,g., m:n relationships) that is the primary key.
I also like having a created and modified timestamp column.
Ensure that every column that you do "where column = val" in any query has an index. Maybe not the most perfect index in the world for the data type, but at least an index.
Set up your foreign keys. Also set up ON DELETE and ON MODIFY rules where relevant, to either cascade or set null, depending on your object structure (so you only need to delete once at the 'head' of your object tree, and all that object's sub-objects get removed automatically).
If you want to modularise your code, you might want to modularise your DB schema - e.g., this is the "customers" area, this is the "orders" area, and this is the "products" area, and use join/link tables between them, even if they're 1:n relations, and maybe duplicate the important information (i.e., duplicate the product name, code, price into your order_details table). Read up on normalisation.
Someone else will recommend exactly the opposite for some or all of the above :p - never one true way to do some things eh!
I really liked this article..
http://www.codeproject.com/Articles/359654/important-database-designing-rules-which-I-fo
Head First SQL is a great introduction.
These are questions which, in my opionion, requires different knowledge from different domains.
You just can't know in advance "which" tables to build, you have to know the problem you have to solve and design the schema accordingly;
This is a mix of database design decision and your database vendor custom capabilities (ie. you should check the documentation of your (r)dbms and eventually learn some "tips & tricks" for scaling), also the configuration of your dbms is crucial for scaling (replication, data partitioning and so on);
again, almost every rdbms comes with a particular "dialect" of the SQL language, so if you want efficient queries you have to learn that particular dialect --btw. much probably write elegant query which are also efficient is a big deal: elegance and efficiency are frequently conflicting goals--
That said, maybe you want to read some books, personally I've used this book in my datbase university course (and found a decent one, but I've not read other books in this field, so my advice is to check out for some good books in database design).
It's been a while since I read it (so, I'm not sure how much of it is still relevant), but my recollection is that Joe Celko's SQL for Smarties book provides a lot of info on writing elegant, effective, and efficient queries.