I am working on a web app and would like to be able to display some computed statistics about different objects to the user. Some good examples would be: The "Top Questions" page on this site - SO - which lists "Votes", "Answers", and "Views" or the "Comment" and "Like" counts for a list of posts on the Facebook News Feed. Actually computed values like these are used all over the web and I am not sure the best way to implement them.
I will describe in greater detail a generic view of the problem. I have a single parent table in my database (you can visualize it as a blog post). This table has a one-to-many realtionship with 5 other tables (visualize it as comments and likes etc.). I need to diplay a list of ~20 parent table objects with the counts of each related child object (visualize it as a list of blog posts each displaying the total number of coments and total number of likes). I can think of multiple ways to tackle the problem, but I am not sure which would be the FASTEST and most ACCURATE.
Here are a number of options I have come up with...
A) SQL Trigger - Create a trigger to increment and decrement a computed count column on the parent table as the child tables have inserts and deletes performed. In not sure about performance tradeoffs running the trigger every time a small child object is created or deleted. I am also unsure about potential concurrency issues (although in my current architecture each row in the db can only be added or deleted by the row creator).
B) SQL View - Just an easier way to query and will yield accurate results, but I am worried about the performance implications for this type of view.
C) SQL Indexed View - An indexed view would be accurate and potentially faster, but as each child table has rows that can be added or removed, the view would constatntly have to be recalculated.
D)Cached Changes - Some kind of interim in process solution that would cache changes to the child tables, computed net changes to the counts, and flush them to the db based on some parameter. This could potentially be coupled with a process that checks for accuracy every so often.
E) Something awesome I haven't thought of yet :) How does SO keep track of soooo many stats??
Using SQL Seerver2008R2
**Please keep in mind that I am building something custom and it is not a blog/FB/SO, I am just using them as an exaple of a similar problem so suggesting to just use those sites is unhelpful ;) I would love to hear how this is accomplished in live web apps that handle a decent volume of requests.
THANKS in Advance
Related
Description
I am developing an app which has Posts in it and for each post users can comment and like.
I am running a PG db on the server side and the data is structured in three different tables: post, post_comments (with ref to post), post_likes (with ref to post).
In the feed of the app I want to display all the posts with the comments count, last comment, number of likes and the last user that liked the post.
I was wondering what is the best approach to create the API calls, and currently have two ideas in mind:
First Idea
Make one large request using a query with multiple joins and parse the query accordingly.
The down side that I see in this approach is that the query will be very heavy which will affect the load time of the users feed as it will have to run over post_comments, post_likes, etc and count all the rows and then retrieve also the last rows.
Second Idea
Add an extra table which I will call post_meta that will store those exact parameters I need and update them when needed.
This approach will make the retrieve query much lighter and faster (faster loading time), but will increase the adding & updating time of comments and likes.
Was wondering if someone could give me some insights about the preferred way to tackle this problem.
Thanks
I'm working on creating an application best described as a CRM. There is a relatively complex table structure, and I'm thinking about allowing users to do a fair bit of customization (adding fields and the like). One concern is that I will be reaching a certain level of scale almost immediately. We have about 50,000 individual users who will be coming online within about nine months of launch. So I want to build to last.
I'm thinking about two and maybe even three options.
One table set with a userID column on everything and with a custom attributes table created by creating a table which indexes custom attributes, then another table which has their values, which can then be joined to the existing contact records for the user. -- From what I've read, this seems like the right option, but I keep feeling like it's not. It seems like once these tables start reaching the millions of records searching for just one users records in every query is going to become a database hog.
For each user account recreate the table set, preened with a unique identifier (the userID for example.) Then rather than using a WHERE userID=? everywhere I can use a FROM ?_contacts. For attributes I could then have a custom attributes table where users could add additional columns for custom attributes. -- This feels like the simplest way to go, though, of course when I decide to change the database structure there would be a migration from hell.
The third option, which I'm pretty confident is wrong, but for that reason alone I can not rule out, is that a new database should be created for each user with all the requisite tables.
Am I crazy? Is option one really the best?
The first method is the best. Create individual userId's and then you can assign specific roles to them. A database retrieval time indeed depends on the number of records too. But, there is a trade-off where you can write efficient sql queries to fetch data. Well, according to this site, you will probably won't run out of memory or run into concurrency issues, because with a good server, the performance ought to be good, provided that you are efficient in writing queries.
If you recreate table sets, you will just end up creating lots of tables and can make the indexing slow which is a bad practice. Whereas if you opt of relational database scheme rather than an ordinary database scheme, and normalize the database and datatables for improving efficiency.
Creating a new database for each and every user, just sums up the complexity from both the above statements resulting in a shabby and disorganized database access. Because, if you decide to run individual instances of databases for every single user, you would just end up consuming your servers physical resources like RAM and CPU usage which will affect the service quality of all the other users.
Take up option 1. Assign separate userIds and assign them roles and privileges where needed. That is more efficient than the other two methods.
I have a table on my database that outlines complex processes in a work breakdown structure (similar to what's used to create Gantt charts). There are multiple rows for a particular process, each row outlining a hierarchical step of a particular process.
I then have a table with some product types, each being linked to a particular process. When an order for a particular product is placed - it is to be manufactured with the associated process.
In my situation, the processes can be dynamic (steps added or removed, for example).
I'm curious as to what the best way to capture current and historical revisions of each process is, such that even though a process may have evolved over time - I can historically go back to a particular order and determine what the process looked like at that time.
I'm sure there are multiple ways to go about this, using logging or triggers with a new history table - but I've had no experience doing something like this and I'd like to know what worked well for others.
On my website, there exists a group of 'power users' who are fantastic and adding lots of content on to my site.
However, their prolific activities has led to their profile pages slowing down a lot. For 95% of the other users, the SPROC that is returning the data is very quick. It's only for these group of power users, the very same SPROC is slow.
How does one go about optimising the query for this group of users?
You can assume that the right indexes have already been constructed.
EDIT: Ok, I think I have been a bit too vague. To rephrase the question, how can I optimise my site to enhance the performance for these 5% of users. Given that this SPROC is the same one that is in use for every user and that it is already well optimised, I am guessing the next steps are to explore caching possibilities on the data and application layers?
EDIT2: The only difference between my power users and the rest of the users is the amount of stuff they have added. So I guess the bottleneck is just the sheer number of records that is being fetched. An average user adds about 200 items to my site. These power users add over 10,000 items. On their profile, I am showing all the items they have added (you can scroll through them).
I think you summed it up here:
An average user adds about 200 items
to my site. These power users add over
10,000 items. On their profile, I am
showing all the items they have added
(you can scroll through them).
Implement paging so that it only fetches 100 at a time or something?
Well you can't optimize a query for a specific result set and leave the query for the rest unchanged. If you know what I mean. I'm guessing there's only one query to change, so you will optimize it for every type of user. Therefore this optimization scenario is no different from any other. Figure out what the problem is; is it too much data being returned? Calculations taking too long because of the amount of data? Where exactly is the cause of the slowdown? Those are questions you need to ask yourself.
However I see you talking about profile pages being slow. When you think the query that returns that information is already optimized (because it works for 95%), you might consider some form of caching of the profile page content. In general, profile pages do not have to supply real-time information.
Caching can be done in a lot of ways, far too many to cover in this answer. But to give you one small example; you could work with a temp table. Your 'profile query' returns information from that temp table, information that is already calculated. Because that query will be simple, it won't take that much time to execute. Meanwhile, you make sure that the temp table periodically gets refreshed.
Just a couple of ideas. I hope they're useful to you.
Edit:
An average user adds about 200 items to my site. These power users add over 10,000 items.
On their profile, I am showing all the
items they have added (you can scroll
through them).
An obvious help for this will be to limit the number of results inside the query, or apply a form of pagination (in the DAL, not UI/BLL!).
You could limit the profile display so that it only shows the most recent 200 items. If your power users want to see more, they can click a button and get the rest of their items. At that point, they would expect a slower response.
Partition / separate the data for those users then the tables in question will be used by only them.
In a clustered environment I believe SQL recognises this and spreads the load to compensate, however in a single server environment i'm not entirely sure how it does the optimisation.
So essentially (greatly simplified of course) ...
If you havea table called "Articles", have 2 tables ... "Articles", "Top5PercentArticles".
Because the data is now separated out in to 2 smaller subsets of data the indexes are smaller and the read and write requests on a single table in the database will drop.
it's not ideal from a business layer point of view as you would then need some way to list what data is stored in what tables but that's a completely separate problem altogether.
Failing that your only option past execution plans is to scale up your server platform.
I will do microblogging web service (for school, so don't blast me for lack of new idea) and I worry that DB could be often be overloaded (user could following other users or even tag so I suppouse that SELECT will be heavy - check 20 latest messages which contains all observing tags and user).
My idea is create another table, and store in it only statusID and userID (who should pick up message). Danger of that is, if some tag or user has many followers there will be a lot of record with that status ID. So, is it good idea? Or maybe better is used M2M relation? (one status -> many receivers)
I think most databases can easily handle large record sets. The responsibility to have it preform lies in your design with properly setting up the indexes. If you create the right indexes the select clauses should perform really well.
I'd go with a users table, a table to have the m2m relationship between users and messages table.
You can then do one select to find all of the users a user is following and then a second select in to get all of the messages of interest (sorting and limiting the results as appropriate). Extending this to tagging should be pretty simple.
This design should be fine for large numbers of users and messages as long as you index the right columns. If you got massive then you could also run the users tables and messages tables to different servers or have read only replicates. I wouldn't even worry about that for the moment - you'd need to be huge.
When implementing Collabinate (http://www.collabinate.com), a service-based engine for microblogging and shared activity streams, I used a graph database. The fact that people create posts and follow other people lends itself to a graph structure. With the right relationships and algorithms, this can be a very efficient and performant solution.