What is the best way to store a threaded message list/tree in SQL? - sql

I'm looking for the best way to store a set of "posts" as well as comments on those posts in SQL. Imagine a design similar to a "Wall" on Facebook where users can write posts on their wall and other users can comment on those posts. I need to be able to display all wall posts as well as the comments.
When I first started out, I came up with a table such as:
CREATE Table wallposts
(
id uuid NOT NULL,
posted timestamp NOT NULL,
userid uuid NOT NULL,
posterid uuid NOT NULL,
parentid uuid NOT NULL,
comment text NOT NULL
)
id is unique, parentid will be null on original posts and point to an id if the row is a comment on an existing post. Easy enough and super fast to insert new data. However, doing a select which would return me:
POST 1
COMMENT 1
COMMENT 2
POST 2
COMMENT 1
COMMENT 2
Regardless of which order the rows existed in the database proved to be extremely difficult. I obviously can't just order by date, as someone might comment on post 1 after post 2 has been posted. If I do a LEFT JOIN to get the parent post on all rows, and then sort by that date first, all the original posts group together as they'd have a value of null.
Then I got this idea:
CREATE TABLE wallposts
(
id uuid NOT NULL,
threadposted timestamp,
posted timestamp,
...
comment text
)
On an original post, threadposted and posted would be the same. On a comment, timestamp would be the time the original post was posted and "posted" would be the time the comment on that thread was posted. Now I can just do:
select * from wallposts order by threadposted, posted;
This works great, however one thing irks me. If two people create a post at the same time, comments on the two posts would get munged together as they'd have the same timestamp. I could use "ticks" instead of a datetime, but still the accuracy is only 1/1000 of a second. I could also setup a unique constraint on threadposted and posted which makes inserts a bit more expensive, but if I had multiple database servers in a farm, the chance of a collision is still there. I almost went ahead with this anyway since the chances of this happening are extremely small, but I wanted to see if I could eat my cake and still have it too. Mostly for my own educational curiosity.
Third solution would be to store this data in the form of a graph. Each node would have a v-left and v-right pointer. I could order by "left" which would traverse the tree in the order I need. However, every time someone inserts a comment I'd have to re balance the whole tree. This would create a ton of row locking, and all sorts of problems if the site was very busy. Plus, it's kinda extreme and also causes replication problems. So I tossed this idea quickly.
I also thought about just storing the original posts and then serializing the comments in a binary form, since who cares about individual comments. This would be very fast, however if a user wants to delete their comment or append a new comment to the end, I have to deserialize this data, modify the structure, then serialize it back and update the row. If a bunch of people are commenting on the same post at the same time, I might have random issues with that.
So here's what I eventually did. I query for all the posts ordered by date entered. In the middle ware layer, I loop through the recordset and create a "stack" of original posts, each node on the stack points to a linked list of comments. When I come across an original post, I push a new node on the stack and when I come across a comment I add a node to the linked list. I organize this in memory so I can traverse the recordset once and have O(n). After I create the in-memory representation of the wall, I traverse through this data structure again and write out HTML. This works great and has super fast inserts and super fast selects, and no weird row locking issues; however it's a bit heavier on my presentation layer and requires me to build an in memory representation of the user's wall to move stuff around so it's in the right order. Still, I believe this is the best approach I've found so far.
I thought I'd check with other SQL experts to see if there's a better way to do this using some weird JOINS or UNIONS or something which would still be performant with millions of users.

I think you're better off using a simpler model with a "ParentID" on Comment to allow for nesting comments. I don't think it's usually a good practice to use datetimes as keys, especially in this case, where you don't really need to, and an identity ID will be sufficient. Here's a basic example that might work:
Post
----
ID (PK)
Timestamp
UserID (FK)
Text
Comment
-------
ID (PK)
Timestamp
PostID (FK)
ParentCommentID (FK nullable) -- allows for nested comments
Text

Do you want people to be able to comment on other comments, i.e. does the tree have infinite depth?
If you just want to have posts and then comments on those posts then you were on the right lines to start with and I believe the following SQL would meet that requirement (Untested so may be typos)
SELECT posts.id,
posts.posted AS posted_at,
posts.userid AS posted_by,
posts.posterid,
posts.comment AS post_text,
comments.posted AS commented_at,
comments.userid AS commented_by,
comments.comment AS comment_text
FROM wallposts AS posts
LEFT OUTER JOIN wallposts AS comments ON comments.parent_id = posts.id
ORDER BY posts.posted, comments.posted
This technique, a self-join, simply joins the table to itself using table aliases to specify the joins.

You should look into "nested sets". They allow retrieving a hierarchy very easily with a single query.
Here's an article about them
If you are using SQL server 2008, it has built-in support for it through the "hierarchyID" type.
Inserts and updates are more costly and complicated if you don't have the built in support), but querying is much faster and easier.
EDIT:
Damn, missed the part where you already knew about it. (was checking from a mobile phone).

If we stick to your table design … I think you would need some special value in the parentid column to separate original posts from comments (maybe just NULL, if you change definition of that column to nullable). Then, self-join will work. Something like this:
SELECT posts.comment as [Original Post],
comments.comment as Comment
FROM wallposts AS posts
LEFT OUTER JOIN wallposts AS comments
ON posts.id=comments.parentID
WHERE posts.parentID IS NULL
ORDER BY posts.posted, comments.posted
The result set shows Original Post before every comment, and has the right order.
(This was done using SQL Server, so I'm not sure if it works in your environment.)

Related

Hierarchy vs many tables

We have a requirement to store locations. There are different types of locations. Areas, Blocks, Buildings, Floors, Rooms and Beds. So, a Bed is in a room, which is on a floor etc.
I think I have two options. First is to have a table for each type. And a foreign key to keep them all linked.
OR...
CREATE TABLE [dbo].[Location]
(
[ID] Int IDENTITY(1,1) NOT NULL,
[ParentID] Int NULL,
[LocationTypeID] Int NOT NULL,
[Description] Varchar(100) COLLATE Latin1_General_CI_AS NOT NULL
)
A table to hold all locations, in a hierarchical style.
I like the idea of this, as if we add new types, it's data driven. No new table. But, the querying can be expensive I think.
If I want to show bed details (Bed 1 in Room 5 on the 4th floor of the science building...), it's a recursive function, which is more tricky than a simple INNER JOIN of all the tables to get details about a location.
One thing though.
I need to record movements. And a movement might be from a room, to an area. So, with separate tables, it will be hard to record movements in a single 'movement' table, as which table do I FK to? With a hierarchy, it's very easy.
Also, reporting on where 1000 people are, would call the recursive query a lot to produce results. Slow? Or is there a clean way to get around this?
All methods have their own pros and cons. Another approach I have seen (especially with dates) is to use one table and encode the hierarchy in a single field (no parent id).
For example ID (integer column) = 1289674566 would mean bed 66, floor 45, etc...
It will require a little work when you need to "extract" a specific hierarchy level (for example to count the number of distinct buildings) but arithmetic operations are quite fast and you can build views on top of the base table if you want to make life easier for end users.
Just another option...
My canned suggestion is to store the data how it is in the real world, then if you're lucky you can query it without too much of a hit. If there is too much of a hit, extract the data into a format that you can easily search.
In your case, I would go with the hierarchical style you are thinking. That way you can have a building with a room with a dresser with a drawer with a box in a box. You can then move the dresser to another room and all the stuff goes with it.
You'll find the recursive CTE's are fast as long as you're not trying to 'trick' SQL server into doing something.
I just answered a hierarchical question here that has a good example for you to play with. In particular pay attention to the SORT_PATH. In the example I build a SORT_PATH that looks something like this:
TEST01.TEST03.LABSTL
SSRS: Recursive Parent Child
You can store this value in your table on EDIT/UPDATE and it can do a lot for you (for performance) as long as you don't mind the hit when you're updating the record.
If you do mind the hit on an update you can use a backend process to keep your SORT_PATH updated. In the past I've used a "DIRTY BIT" field that gets flipped when something is modified; a backend process then comes through and updates everything related to that record, but since it's a backend process users don't notice the impact. This is a good job for the SEVICE BROKER to manage -- on edit/update/delete set the DIRTY_BIT=True and send the SERVICE BROKER a message that will kick off a process that updates anything with DIRTY_BIT=True.
Have a look at using the hierachyid data type, it's a CLR datatype native to SQL server and has built in functions for querying parent/child type relationships:
Tutorial: Using the hierarchyid Data Type

How to build a Sql Server table to store Quiz answers paired with the questionId?

I have an .aspx form that has about 50 multiple choice questions in total on the survey.
Should I build a delimited string of the question id and the answer given and just store a string?
The problem with that is that the string might be long so a datatype text would be required?
This would allow all of the answers to be in 1 record.
Alternatively I was considering something like this where each answer is it's own record and each submitted survey would need to be joinied by a uniqueidentifier.
What is the correct approach for this, or even something I have not thought of?
I have an .aspx form that has about 50 multiple choice questions in total on the survey.
Should I build a delimited string of the question id and the answer given and just store a string?
The problem with that is that the string might be long so a datatype text would be required?
This would allow all of the answers to be in 1 record.
Alternatively I was considering something like this where each answer is it's own record and each submitted survey would need to be joinied by a uniqueidentifier.
What is the correct approach for this, or even something I have not thought of?
CREATE TABLE [dbo].[surveyAnswers](
[id] [int] IDENTITY(1,1) NOT NULL,
[questionId] [int] NOT NULL,
[quizId] [uniqueidentifier] NOT NULL,
[answerValue] [varchar](50) NULL,
[quizDate] [datetime] NOT NULL) ON [PRIMARY]
Do not store denormalized data - this will make life a pain later on1. Okay, now that that's out of the way we can skip past the first approach .. :)
The approach at the bottom looks "like a start", but there are several issues. Firstly, quizDate has no business being there (it is related to a quiz, not an answer) and, secondly, the schema doesn't capture who actually took the quiz.
My model (shown as a simplified form as I capture criteria groupings, multiple deployments, dimensions, aspects and multi-values which are overkill for a simple case) looks similar to the following at heart:
-- Survey/Quiz "has many Questions"
Survey (SurveyStartedAt, SurveyExpiresAt)
-- Question "belongs to a Survey"
Question (FK Survey, QuestionPrompt, QuestionRules..)
-- Each Participant can "respond to a Survey"
Response (FK Participant, FK Survey, ResponseTime)
-- And each Answer, of which there are many per Response,
-- "contains the answer for a single Question"
Answer (FK Response, FK Question, Value)
Using the above approach allows me to run queries like:
Who took (or didn't take) the quiz? When? How many times?
What was the average rating for "Likert" Questions?
Which optional questions were not answered?
(And many more including advanced roll-ups; SQL Server is quite powerful.)
Note that I eschew a bit of safety and normalization (as Answer->Question, but Response->Survey->Question) so the schema does not prevent an invalid Answer(Response, Question) pair. While this could be dealt with by feeding the Survey through into the Answer relation (and, perhaps better, adding a SurveyQuestions relation), I deal with that by imposing an immutable design and a gatekeeper: once a Quiz starts the Survey (and all related Questions) can never be altered.
1 While it might be tempting to use one record-per-quiz result based on "performance" to "avoid creating excessive rows" - don't; there is no performance issue here! The Answer records are very small (can be compacted nicely on pages) and, when properly indexed, are very fast to retrieve and run queries against. At the point where such decomposition leads to "too many" records (in excess of many millions), other approaches can be considered - but do not start with a denormalized relation schema!

Efficient Implementation of a Notification System -- Should I use or avoid JOINs?

The Tables
Let us assume that we have a table of articles:
CREATE TABLE articles
(
id integer PRIMARY KEY,
last_update timestamp NOT NULL,
...
);
Users can bookmark articles:
CREATE TABLE bookmarks
(
user integer NOT NULL REFERENCES users(id),
article integer NOT NULL REFERENCES articles(id),
PRIMARY KEY(user, article),
last_seen timestamp NOT NULL
);
The Feature to be Implemented
What I want to do now is to inform users about articles which have been updated after the user has last seen them. The access to the whole system is via a web interface. Whenever a page is requested, the system should check whether the user should be notified about updated articles (similar to the notification bar on the top of a page here on SO).
The Question
What is the best and most efficient implementation of such a feature, given that both tables above contain tens of millions of rows?
My Solution #1
One could do a simple join like this:
SELECT ... FROM articles, bookmarks WHERE bookmarks.user = 1234
AND bookmarks.article = articles.article AND last_seen < last_update;
However, I'm worried that doing this JOIN might be expensive if the user has many bookmarked articles (which might happen more often than you think), especially if the database (in my case PostgreSQL) has to traverse the index on the primary key of articles for every bookmarked article. Also the last_seen < last_update predicate can only be checked after accessing the rows on the disk.
My Solution #2
Another method is more difficult, but might be better in my case. It involves expanding the bookmarks table by a notify column:
CREATE TABLE bookmarks
(
user integer NOT NULL REFERENCES users(id),
article integer NOT NULL REFERENCES articles(id),
PRIMARY KEY(user, article),
last_seen timestamp NOT NULL,
notify boolean NOT NULL DEFAULT false
);
CREATE INDEX bookmark_article_idx ON bookmarks (article);
Whenever an article is updated, the update operation should trigger setting notify to true for every user who has bookmarked this article. The big disadvantage that comes to mind is that if an article has been bookmarked a lot, setting notify to true for lots of rows can be expensive. The advantage could be that checking for notifications is as simple as:
SELECT article FROM bookmarks WHERE user = 1234 AND notify = true;
Final Thoughts
I think that the second method can be a lot more efficient if the number of page views (and with it the number of times the system checks for notifications) outweighs the number of updates of articles. However, this might not always be the case. There might be users with lots of bookmarked articles which log in only once a month for a couple of minutes, and others who check for updates almost every minute.
There's also a third method that involves a notification table in which the system INSERTs notifications for every user once an article is updated. However, I consider that an inefficient variant of Method #2 since it involves saving notifications.
What method is the most efficient when both tables contain millions of rows? Do you have another method that might be better?
Want to improve this post? Provide detailed answers to this question, including citations and an explanation of why your answer is correct. Answers without enough detail may be edited or deleted.
I would certainly go for solution one, making sure that articles has an index on (article,last_update).
Normalization theory takes you directly to solution #1. Rather than asking which design is faster, you might want to ask, how do I make my server executes this query efficiently given my bog-standard BCNF tables. :-)
If your server cannot be made to execute your query fast enough (for whatever value of enough in your case) you need a faster server. Why? Because performance will only degrade as users and rows are added. Normalization was invented to minimize updates and update anomalies. Use it to your advantage, or pay the price in hours of your time and hard-to-detect errors in your system.
I see a third solution, to make things more interesting. ;-) It is a mixture of both solutions. I would assume that there is a time of the day or night where little usage is on the system and make a dayly/nightly run to mark all the bookmarks that are new.
That alone would delay the information "new article updates for you!" for a day which is not what you want. But I would store an additional column "updated today" (enum "Yes", "No" or tinyint) which is set to "Yes" at article- update and reset to "No" on that nightly update-run.
Then show the "has changes" for all bookmarks with the mark "is changed" (from nightly cron) and additionally add the Information with the select from version 1, but restricted to the articles which have changed today.
Probably most articles get not updated dayly, so you should win with that.
Of course I would approve the measurement-answer, but you need a lot of assumptions to make a good benchmark.

Database structure for voting system with up- and down votes

I am going to create a voting system for a web application and wonder what the best way would be to store the votes in the (SQL) database.
The voting system is similiar to the one of StackOverflow. I am pondering now if I should store the up and down votes in different tables. That way it is easier to count all up votes resp. down votes. On the other hand I have to query two tables to find all votes for an user or voted item.
An alternative would be one table with a boolean field that specifies if this vote is an up or down vote. But I guess counting up or down votes is quite slow (when you have a lot of votes), and an index on a boolean field (as far as I know) does not make a lot of sense.
How would you create the database structure? One or two tables?
Regarding the comments, we found the solution that best fits to Zardoz
He does not want to always count votes and needs as much details as possible. So the solution is a mix of both.
Adding an integer field in the considered table to store vote counts (make sure there won't be overflows).
Create additional tables to log the votes (user, post, date, up/down, etc.)
I would recommend to use triggers to automatically update the 'vote count field' when inserting/deleting/updating a vote in the log table.
If your votes are just up/down then you could make a votes table linking to the posts and having a value of 1 or -1 (up / down). This way you can sum in a single go.
https://meta.stackexchange.com/questions/1863/so-database-schema
Worth a look or
http://sqlserverpedia.com/wiki/Understanding_the_StackOverflow_Database_Schema
You will need a link table between users and the entities which are being voted on, I would have thought. This will allow you to see which users have already voted and prevent them from submitting further votes. The table can record in a boolean whether it is an up or down vote.
I would advise storing in the voted entity a current vote tally field to ease querying. The saving in size would be negligible if you omitted this.

Recommended Table Set up for one to many/many to one situation

I need to create a script where someone will post an opening for a position, and anyone who is eligible will see the opening but anyone who is not (or opts out) will not see the opening. So two people could go to the same page and see different content, some potentially the same, some totally unique. I'm not sure the best way to arrange that data in a MySQL DB/table.
For instance, I could have it arranged by the posting, but that would look sort of like:
PostID VisibleTo
PostingA user1,user2
And that seems wrong (the CSV style in the column). Or I could go with by person:
User VisiblePosts
user1 posting1, posting2
But it's the same problem. Is there a way to make the user's unique, the posting unique, and have them join only where they match?
The decision is initially made by doing a series of queries to another set of tables, but once that is run, it seems inefficient to have that some chunk of code run again and again when it won't change after the user posts the position.
...On second thought, it MIGHT change, but if we assume it doesn't (as it is unlikely, and as little consequence if a user sees something that they are no longer eligible for), is there a standard solution for this scenario?
Three tables...
User:
[UserId]
[OtherField]
Post:
[PostId]
[OtherFields]
UserPost:
[UserId]
[PostId]
User.UserId Joins to UserPost.UserId,
Post.PostId Joins to UserPost.PostId
Then look up the table UserPost, joining to Post when you are selecting which posts to show
This is a many-to-many relationship or n:m relationship.
You would create an additional table, say PostVisibility, with a column PostID and UserID. If a combination of PostID and UserID is present in the table, that post is visible to that user.
Edit: Sorry, I think you are speaking in Posting-User terms, which is many-to-many. I was thinking of this in terms of posting-"viewing rights" terms, which is one-to-many.
Unless I am missing something, this is a one-to-many situation, which requires two tables. E.g., each posting has n users who can view it. Postings are unique to an individual user, so you don't need to do the reverse.
PostingTable with PostingID (and other data)
PostingVisibilityTable with PostingID and UserID
UserTable with UserID and user data
Create the postings independently of their visibility rights, and then separately add/remove PostingID/UserID pairs against the Visibility table.
To select all postings visible to the current user:
SELECT * FROM PostingTable A INNER JOIN PostingVisibilityTable B ON A.PostingID = B.PostingID WHERE B.UserID = "currentUserID"