Generate webpages directly from database or cache? - optimization

[I'm not asking about the architecture of SO, but it would be helpful to the question.]
On SO, when a user clicks on his/her name and clicks on "responses" they see other users responses to comment threads, questions, and answers in which they have participated. I've had the sneaking suspicion that I've missed certain responses out there, which made me wonder: if you had to build that thing, would you pull everything dynamically from the database every time a user requested it? Or would you modify it when there is new related activity in the application? Or would you build it in a nightly daemon process?
I imagine that the real answer is that it's dynamically constructed every time, but that the tables are denormalized in such a way so as to make the thing less time-consuming. How would you build it?
I'm asking about any platform, of course, not only on .Net.

I would pull it dynamically from the database every time. I think this gives you the best result from a user experience and then I would apply the principal that premature optimization is evil. Later if there were performance issues I would look into caching.
I think doing it as a daemon/push process would actually result in more overall work being done. That is the updates would happen more frequently than the users are requesting the info.

Obviously, when an answer or comment is posted, you'll want to identify the user that should be informed in their responses tab. Then just add a row to a responses table containing the response text, timestamp, and the user to which it belongs. That way you can dynamically generate the tab with a simple
select * from responses where user=<userid> order by time desc limit 30
or something like that.
p.s. Extra credit to anyone that can write a query that will remove old responses - assume that each person should have the last 30 responses in their responses tab.

I expect that userid would be a natural option for the clustered index. If you have an "Active" boolean field then you don't need to worry much about locks; the table could be write-only except to update the (unindexed) Active column. I bet it already works that way, since it appears that everything is recoverable.
Don't need no stinking extra-credit response remover.

I would assume this is denormalized in the database. The Comment table probably has both and answer_id and an answer_uid so the SQL to find comments on you answers just run against the comment table. The same setup would work on the Answer table. Each answer has a question_id and a question_uid.
Having said that, these are probably the same table and you have response_to_id and response_to_uid and that makes lots of code simpler and makes the "recent" tab a single select as well. In fact the difference between the two selects is one uses the uid and one uses the response_to_uid.

I'd say that your UI and your database should both be driven by your Application Domain; so they will reflect each other based on their common provenance there.
Some quick notes to illustrate, using simplified Object Role Modeling as discussed by Fowler et al.
Entities
Users
Questions
Answers
Comments
Entity Roles
(Note: In Object Role Modeling, most Roles are reflexive. Some, e.g. booleans here, are monopolar)
Question has User
Question has QuestionVersions
Question as Answers
Question has Comments
Answer has AnswerVersions
Answer has Comments
Question has User
QuestionVersion has Text
QuestionVersion has Timestamp
QuestionVersion has IsDeleted (could be inferred from nonNULL timestamp eg)
QuestionVersion has DeltedByUser
QuestionVersion has DeletedTimestamp
Answer has User
AnswerVersion has Text
AnswerVersion has Timestamp
AnswerVersion has IsDeleted
AnswerVersion has DeltedByUser
AnswerVersion has DeletedTimestamp
Comment has Text
Comment has User
Comment has Timestamp
Comment IsDeleted (boolean)
(note - no versions on comments)
I think that's the basics. These assertions drive ERDs in ORM. Hopefully it's self-evident how they drive the User Stories as well.
I don't think an implementation of a normalized design like this would require denormalization - especially since I think it's clear (from behavior) that queries => UI displays are cached to be refreshed 1X per minute.

Related

Setting scope for :sorted_by to select max date from join table

I am sure this is a simple syntax issue and I am taking a chance by asking the question but I just can't figure it out on my own. I have a functioning Filterrific :sorted_by selector, works great, sorts my Ticket records according to field values text and date types.
My Tickets table joins to a Comments table in a one-to-many relationship, and I am trying to sort a set of Ticket records by the newest update date from its associated Comment records.
Initially my solution was to add this to my scope:
when /*comment_date/
order("tickets.comments.updated_at #{direction}").includes(:comments).references(:comments)
The behavior is perplexing, my data is not being sorted across pagination, but perhaps coincidentally the records on a selected page are in order. I would love to post the backend SQL but work constraints do not allow it, so I hope I am not scolded or chastised for not providing enough evidence of my research.
Trying to find examples of Filterrific solutions is tough, after awhile I keep seeing the same posts on different websites. I have read every open and closed issue on GitHub, and the 79 or so tagged questions here on SO. It's so frustrating because I am sure there is a simple, logical solution but I can't see it!
As always, thank you for your time. I consider your attention a privilege, not a right, and I value any guidance.
Use joins instead of includes that way, the table is in your query.
...joins(:comments).references(:comments)
Same as what #thesecretmaster mentioned. Just making a post so you can close this

Primary Key Type Guid or Int? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I am wondering what is the recommended type for PK in sql server? I remember reading a long time ago this article but now I am wondering if it is still a wise decision to use GUID still.
One reason that got me thinking about it is, these days many sites use the id in the url for instance Course/1 would get the information about that record.
You can't really do that with a guid, which would mean you would need some new column that would be unique and use that, what is more work as you got to make sure each record has a unique number.
There is never a "one solution fits all". You have to carefully design your architecture and select the best options for your scenario. Both INT and GUID types are valid options like they've always been.
You can absolutely use GUID in a URL. In fact, in most scenarios, it is better to use a GUID (or another random ID) in the URL than a sequential numeric ID for security reason. If you use sequential ID, your site visitors will be able to easily guess other users' IDs and potentially access their contents. For example, if my profile URL is /Profiles/111, I can try Profile/112 and see if I can access it. If my reservation URL is Reservation/444, I can try Reservation/441 and see what happens. I can easily guess other IDs in the system. Of course, you must have strong permissions, so I should not be able to see those other pages that don't belong to my account, but if there is any issues or holes in your permissions and security, a breach can happen. While with GUID and other random IDs, there is no way to guess other IDs in the system, so such a breach is much more difficult.
Another issue with sequential IDs is that your users can guess how many accounts or records you have and their order in your database. If my ID is 50269, I know that you must have almost this number of records. If my Id is 4, then I know that you had a very few accounts when I registered. For that reason, many developers start the first ID at some random high number like 1529 instead of 1. It doesn't solve the issue entirely, but it avoid the issues with small IDs. How important all that guessing is depends on the system, so you have to evaluate your scenario carefully.
That's on the top of the benefits mentioned in the article that you mentioned in your question. But still, an integer is better in some areas, so choose the best option for your scenario.
EDIT To answer the point that you raised in your comment about user-friendly URLs. In those scenarios, sequential numbers is the wrong answer. A better solution is a unique string in the URL which is linked to your numeric ID. For example, the Cars movie has this URL on IMDB:
https://www.imdb.com/title/tt0317219/
Now, compare that to the URL of the same movie on Wikipedia, Rotten Tomatoes, Plugged In, or Facebook:
https://en.wikipedia.org/wiki/Cars_(film)
https://www.rottentomatoes.com/m/cars/
https://www.pluggedin.ca/movie-reviews/cars/
https://www.facebook.com/PixarCars
We must agree that those URLs are much friendlier than the one from IMDB.
I've worked on small, medium, and large scale implementations(100k+ users) with SQL and Oracle. The major of the time PK type of INT is used when needed. The GUID was more popular 10-15 years ago, but even at its height was not as populate as the INT. Unless you see a need for it I would recommend INT.
My experience has been that the only time a GUID is needed is if your data is on the move or merged with other databases. For example, say you have three sites running the same application and you merge those three systems for reporting purposes.
If your data is stationary or running a single instance, int should be sufficient.
According to the article you mention:
GUIDs are unique across every table, every database, every server
Well... this is a great promise, but fails to deliver. GUID are supposed to be unique snowflakes. However, reality is much more complicated than that, and there are numerous reasons why they end up not being unique.
One of the main reasons is not related to the UUID/GUID specification, but by poor implementations of it. For example some Javascript implementations rank as the worst ones, using pseudo random numbers that are quite predictable. Other implementations are much more decent.
So, bottom line, study the specific implementation of UUID/GUID you are and will be using. Don't just read and trust the specification. Otherwise you may be up for a surprise, when you get called at 3 am on a Saturday night by angry customers.

Is this how a modern news site would handle it's sql/business logic?

Basically, the below image represents the components on the homepage of a site I'm working on, which will have news components all over the place. The sql snippets envision how I think they should work, I would really appreciate some business logic advice from people who've worked with news sites before though. Here's how I envision it:
Question #1: Does the sql logic make sense? Are there any caveats/flaws that you can see?
My schema would be something like:
articles:
article_id int unsigned not null primary key,
article_permalink varchar(100),
article_name varchar(100),
article_snippet text,
article_full text
article_type tinyint(3) default 1
I would store all articles ( main featured, sub-featured, the rest ) in one table, I would categorize them by a type column, which would correspond to a number in my news_types table ( for the example I used literal text as it's easier to understand ).
Question #1.1: Is it alright to rely on one table for different types of articles?
A news article can have 3 image types:
1x original image size, which would show up only on the article's permalink page
1x main featured image, which would show up on the homepage section #1
1x sub featured image, which would show up in the homepage section #2
For now I want each article to correspond to one image and not multiple. A user can post images for the article in the article_full TEXT column though.
Question #1.2: I'm not sure how I should incorporate the article images into my schema, is it common for a schema that relies on 2 tables like this?
article_image_links:
article_id article_image_id
1 1
article_images:
article_image_id article_image_url
1 media.site.com/articles/blah.jpg
Requirements for the data:
From the way I have my sql logic, there has to be some data in order for stuff to show up..
there has to be at least one main type article
there has to be at least four featured type articles which are below the main one
Question #1.3: Should I bother creating special cases for if data is missing? For example, if there's no main featured article, should I select the latest featured, or should I make it a requirement that someone always specify a main article?
Question #1.4: In the admin, when a user goes to post, by default I'll have a dropdown which specifies the article type, normal will be pre-selected and it will have the option for main and featured. So if a user at one point decides to change the article type, he/she can do that.
Question #1.5: The way my featured and main articles work is only by the latest date. If the user wants, for example, to specify an older article for whatever reason as the main article, should I create custom logic, or just tell them to update the article date so it's later than the latest?
In regard to the question in the title there is definitely more than one way to skin a cat. What's right for one site may not be right for another. Some things that could factor into your decision are how large the site needs to be scaled (eg are there going to be dozens of articles or millions?) and who will be entering the data (eg how much idiot-proofness do you need to build in). I'll try to answer the questions as best I can based on the information you gave.
Question # 1: Yes, looks fine to me. Be sure to set your indexes (I'd put indexes on [type,date] and [category,type,date]).
Question #1.1: Yes, I would say that is alright, in fact, I would say it is preferred. If I understand the question correctly (that this is as opposed to a table for each "type") then this sets you up better for adding new types in the future if you want to.
Question #1.2: If you only want one image for each story and one story for each image I'm not seeing the advantage of splitting that up into an extra table. It seems like it's just more overhead. But I could be missing something here.
Question #1.3: That's a design decision up to you, there's no "right" answer here. It all depends on your intended uses of the system.

In a site like StackOverflow should the Question and its Votes be separate tables?

I'm making a site like StackOverflow in Rails but I'm not sure if it's necessary for the Votes on a question to be stored in a separate table in the database.
Is there any good reason to separate the data?
Or could I store the Votes as a single sum in a field of the Questions table?
How would you know if a user voted on a question without keeping a votes table? Or like this website that holds you to X votes a day, how would you know how many votes a user made in the day? How would you keep track of how many up and down votes a user has done? I think good design practices pretty much scream for you to normalize the data and keep a votes table, with perhaps keeping a current +/- denormalized field in the question row for easy fetching.
Yes! Think about it from an object perspective. In model driven development (objects first) you would have a container (table) of questions, and a container of votes. Of course you could simply roll them up to an aggregate form. However by doing that you lose a lot of metric detail such as who cast the vote, when, etc. It really depends on if you need the detail or not. Space is cheap so not keeping the detail is usually not a good idea. It is hard to foresee what is needed in the future!
Think about your data in multiple dimensions. There's more going on than the mere number of votes. There's:
Who cast the vote
When they cast the vote
The effect (think like a financial transaction) of the vote on any number of parties
Can you afford to discard this data? Will you ever need it? In Stackoverflow, it must be known whether I voted on something to determine if I can vote; what the vote was, so I can change it; the effect of the vote so it can be rolled back if I change it; etc.
Votes would also need to be able to be applied to both questions and answers, although both questions and answers could be stored in the one table/class called Post or somthing similar since they are the same data with a different title.
Like the last two answers say: keep a separate votes table.
But it would be advisable to create a view that will aggregate votes per user, per question etc. so that you don't need to do a manual query when you need that info.
jrh
Yes, I would go so far as to say that it is vital to help reduce the likelyhood that one person could bias the result by repeatedly voting something up or down.
It actually has very little to do with OOP, and more to do with preventing exploits.
For performance reasons you could use a static vote count in the questions table that gets updated when the vote data for a question changes. I would not though only use a vote count by itself unless you really don't care about results being biased by particular people.

What's the best way to store/calculate user scores?

I am looking to design a database for a website where users will be able to gain points (reputation) for performing certain activities and am struggling with the database design.
I am planning to keep records of the things a user does so they may have 25 points for an item they have submitted, 1 point each for 30 comments they have made and another 10 bonus points for being awesome!
Clearly all the data will be there, but it seems like a lot or querying to get the total score for each user which I would like to display next to their username (in the form of a level). For example, a query to the submitted items table to get the scores for each item from that user, a query to the comments table etc. If all this needs to be done for every user mentioned on a page.... LOTS of queries!
I had considered keeping a score in the user table, which would seem a lot quicker to look up, but I've had it drummed into me that storing data that can be calculated from other data is BAD!
I've seen a lot of sites that do similar things (even stack overflow does similar) so I figure there must be a "best practice" to follow. Can anyone suggest what it may be?
Any suggestions or comments would be great. Thanks!
I think that this is definitely a great question. I've had to build systems that have similar behavior to this--especially when the table with the scores in it is accessed pretty often (like in your scenario). Here's my suggestion to you:
First, create some tables like the following (I'm using SQL Server best practices, but name them however you see fit):
UserAccount UserAchievement
-Guid (PK) -Guid (PK)
-FirstName -UserAccountGuid (FK)
-LastName -Name
-EmailAddress -Score
Once you've done this, go ahead and create a view that looks something like the following (no, I haven't verified this SQL, but it should be a good start):
SELECT [UserAccount].[FirstName] AS FirstName,
[UserAccount].[LastName] AS LastName,
SUM([UserAchievement].[Score]) AS TotalPoints
FROM [UserAccount]
INNER JOIN [UserAchievement]
ON [UserAccount].[Guid] = [UserAchievement].[UserAccountGuid]
GROUP BY [UserAccount].[FirstName],
[UserAccount].[LastName]
ORDER BY [UserAccount].[LastName] ASC
I know you've mentioned some concern about performance and a lot of queries, but if you build out a view like this, you won't ever need more than one. I recommend not making this a materialized view; instead, just index your tables so that the lookups that you need (essentially, UserAccountGuid) will enable fast summation across the table.
I will add one more point--if your UserAccount table gets huge, you may consider a slightly more intelligent query that would incorporate the names of the accounts you need to get roll-ups for. This will make it possible not to return huge data sets to your web site when you're only showing, you know, 3-10 users' information on the page. I'd have to think a bit more about how to do this elegantly, but I'd suggest staying away from "IN" statements since this will invoke a linear search of the table.
For very high read/write ratios, denormalizing is a very valid option. You can use an indexed view and the data will be kept in sync declaratively (so you never have to worry about there being bad score data). The downside is that it IS kept in sync.. so the updates to the store total are a synchronous aspect of committing the score action. This would normally be quite fast, but it is a design decision. If you denormalize yourself, you can choose if you want to have some kind of delayed update system.
Personally I would go with an indexed view for starting, and then later you can replace it fairly seamlessly with a concrete table if your needs dictate.
In the past we've always used some sort of nightly or perodic cron job to calculate the current score and save it in the database - sort of like a persistent view of the SUM on the activities table. Like most "best practices" they are simply guidelines and it's often better and more practical to deviate from a specific hard nosed practice on very specific areas.
Plus it's not really all that much of a deviation if you use the cron job as it's better viewed as a cache stored in the database.
If you have a separate scores table, you could update it each time an item is submitted or a comment is posted by a user. You could do this using a trigger or within the sites code.
The user scores would be updated continuously, and could be quickly queried for display.