Description
I am developing an app which has Posts in it and for each post users can comment and like.
I am running a PG db on the server side and the data is structured in three different tables: post, post_comments (with ref to post), post_likes (with ref to post).
In the feed of the app I want to display all the posts with the comments count, last comment, number of likes and the last user that liked the post.
I was wondering what is the best approach to create the API calls, and currently have two ideas in mind:
First Idea
Make one large request using a query with multiple joins and parse the query accordingly.
The down side that I see in this approach is that the query will be very heavy which will affect the load time of the users feed as it will have to run over post_comments, post_likes, etc and count all the rows and then retrieve also the last rows.
Second Idea
Add an extra table which I will call post_meta that will store those exact parameters I need and update them when needed.
This approach will make the retrieve query much lighter and faster (faster loading time), but will increase the adding & updating time of comments and likes.
Was wondering if someone could give me some insights about the preferred way to tackle this problem.
Thanks
Related
I have run a query on Google BigQuery several hours ago, and the query is still running. I clicked "abandon", but it appears there is no way to stop a query. What can I do? Can I contact Google somehow, so they stop the query?
I've been working on a project for a company which analyzes Google Analytics data with BigQuery, so I don't want to run them a big bill or something.
(Maybe StackOverflow is not the right place to ask this question, but I've tried to find another place, and I couldn't. On the BigQuery support page, it is said that questions should be asked here, with the google-bigquery tag, so I'm doing that).
I've written a query (which I don't want to paste or describe here, as someone might abuse it to block BigQuery or something, I don't know). Let's just say it includes inner joins. After I've written it, and before running it, the console message was something like "This will analyze 674KB of data", which looked OK, given the fact that the table only has 10,000 rows. I've got the same message after clicking on "abandon" query, something like "You can abandon this, but you will still be billed for 674KB of data".
I try very hard to make sure what I do doesn't cause problems to someone, so I've actually run that query on a local PostgreSQL database (with the exact same data - 10,000 rows) as in BigQuery, and the query there finishes in a second or two.
How can I cancel this query, and can I (the company I've worked for) be billed for something more than 674KB of data?
At the time being, there is no way to stop a BigQuery job once it's started, neither via web interface or API calls.
According to this, this feature may be added in the future.
As BigQuery will shard the query to multiple machines, even a large query (TeraByte level) will not have a large impact on an individual machine, let alone a query of 674KB. However, according to this, this is the amount that you will be charged.
Here are some tips to save money in BigQuery.
First thing to know is that, unlike traditional RDBMS, BigQuery is column based, and you will be charged by the amount of data in the columns rather than in the rows.
That means, don't include columns that you do not need in the query. This may sound trivial, but sometimes people coming from RDBMS may write queries like this:
SELECT
COUNT(*), user_id
FROM
[Dataset.Table]
The query is absolutely correct, but instead of being charged only the size of user_id column, Google would actually bill the whole table for this query. Therefore it's a good idea to explicitly specify the column names.
Break the tables into smaller chunks. Instead of having a single table that contains all the data, it's a good idea to split the table according to date, and use table wildcard functions to stitch the tables together during query. In this case, you won't be billed by rows that you don't need.
BigQuery supports canceling query jobs.
You can do this via the bq command line utility:
bq cancel <job_id>
or from the API via the jobs.cancel method (documented here)
Let's pretend I have a huge website and a huge table (with some million entries table) with a few columns ("ID", "AuthorID", "Message", "Time" for example) to contain twitter-like messages.
I want to execute the following simple query:
SELECT * FROM HugeTable ORDER BY Time DESC LIMIT 1,10;
This query shall be executed a lot of times (tens per second). How do I make sure that this query is very fast?
I tought memcached could have been a solution, but new posts are added very quickly, and using memcached I would provide "old" messages to users.
Assume that I have only one mysql server, and it is so good that it can handle all the traffic.
My problem is that the server should take the whole table, sorting that (huge bottleneck here), and then take only the first 10. So, what is the best optimization that I could do? Partitioning maybe? Also, inside the table, newer posts are put on the bottom, so it's safe to assume that a new post will have "ID" and "Time" >= than the previous.
Thanks in advance.
P.S: I'm not an expert of MySql (even though I know the basis), and I have no clue about NoSql methods. If you believe NoSql is the way to suffice my task, then I'm open to learn using something new :)
As you surmise, caching is the way to go. Either by creating a parallel table with the ten records you want in it (each time you do an insert, you remove the oldest one), or by doing the same thing further up the stack in memory. It's about how you manage what's in the cache.
No an answer to your question, but as an answer to your problem. I wouldn't use a query but would use a websocket solution to push posts to clients when they arrive. Connected clients would always receive the latest post when they arrive and a websockets solution should have less overhead.
I am working on a web app and would like to be able to display some computed statistics about different objects to the user. Some good examples would be: The "Top Questions" page on this site - SO - which lists "Votes", "Answers", and "Views" or the "Comment" and "Like" counts for a list of posts on the Facebook News Feed. Actually computed values like these are used all over the web and I am not sure the best way to implement them.
I will describe in greater detail a generic view of the problem. I have a single parent table in my database (you can visualize it as a blog post). This table has a one-to-many realtionship with 5 other tables (visualize it as comments and likes etc.). I need to diplay a list of ~20 parent table objects with the counts of each related child object (visualize it as a list of blog posts each displaying the total number of coments and total number of likes). I can think of multiple ways to tackle the problem, but I am not sure which would be the FASTEST and most ACCURATE.
Here are a number of options I have come up with...
A) SQL Trigger - Create a trigger to increment and decrement a computed count column on the parent table as the child tables have inserts and deletes performed. In not sure about performance tradeoffs running the trigger every time a small child object is created or deleted. I am also unsure about potential concurrency issues (although in my current architecture each row in the db can only be added or deleted by the row creator).
B) SQL View - Just an easier way to query and will yield accurate results, but I am worried about the performance implications for this type of view.
C) SQL Indexed View - An indexed view would be accurate and potentially faster, but as each child table has rows that can be added or removed, the view would constatntly have to be recalculated.
D)Cached Changes - Some kind of interim in process solution that would cache changes to the child tables, computed net changes to the counts, and flush them to the db based on some parameter. This could potentially be coupled with a process that checks for accuracy every so often.
E) Something awesome I haven't thought of yet :) How does SO keep track of soooo many stats??
Using SQL Seerver2008R2
**Please keep in mind that I am building something custom and it is not a blog/FB/SO, I am just using them as an exaple of a similar problem so suggesting to just use those sites is unhelpful ;) I would love to hear how this is accomplished in live web apps that handle a decent volume of requests.
THANKS in Advance
On my website, there exists a group of 'power users' who are fantastic and adding lots of content on to my site.
However, their prolific activities has led to their profile pages slowing down a lot. For 95% of the other users, the SPROC that is returning the data is very quick. It's only for these group of power users, the very same SPROC is slow.
How does one go about optimising the query for this group of users?
You can assume that the right indexes have already been constructed.
EDIT: Ok, I think I have been a bit too vague. To rephrase the question, how can I optimise my site to enhance the performance for these 5% of users. Given that this SPROC is the same one that is in use for every user and that it is already well optimised, I am guessing the next steps are to explore caching possibilities on the data and application layers?
EDIT2: The only difference between my power users and the rest of the users is the amount of stuff they have added. So I guess the bottleneck is just the sheer number of records that is being fetched. An average user adds about 200 items to my site. These power users add over 10,000 items. On their profile, I am showing all the items they have added (you can scroll through them).
I think you summed it up here:
An average user adds about 200 items
to my site. These power users add over
10,000 items. On their profile, I am
showing all the items they have added
(you can scroll through them).
Implement paging so that it only fetches 100 at a time or something?
Well you can't optimize a query for a specific result set and leave the query for the rest unchanged. If you know what I mean. I'm guessing there's only one query to change, so you will optimize it for every type of user. Therefore this optimization scenario is no different from any other. Figure out what the problem is; is it too much data being returned? Calculations taking too long because of the amount of data? Where exactly is the cause of the slowdown? Those are questions you need to ask yourself.
However I see you talking about profile pages being slow. When you think the query that returns that information is already optimized (because it works for 95%), you might consider some form of caching of the profile page content. In general, profile pages do not have to supply real-time information.
Caching can be done in a lot of ways, far too many to cover in this answer. But to give you one small example; you could work with a temp table. Your 'profile query' returns information from that temp table, information that is already calculated. Because that query will be simple, it won't take that much time to execute. Meanwhile, you make sure that the temp table periodically gets refreshed.
Just a couple of ideas. I hope they're useful to you.
Edit:
An average user adds about 200 items to my site. These power users add over 10,000 items.
On their profile, I am showing all the
items they have added (you can scroll
through them).
An obvious help for this will be to limit the number of results inside the query, or apply a form of pagination (in the DAL, not UI/BLL!).
You could limit the profile display so that it only shows the most recent 200 items. If your power users want to see more, they can click a button and get the rest of their items. At that point, they would expect a slower response.
Partition / separate the data for those users then the tables in question will be used by only them.
In a clustered environment I believe SQL recognises this and spreads the load to compensate, however in a single server environment i'm not entirely sure how it does the optimisation.
So essentially (greatly simplified of course) ...
If you havea table called "Articles", have 2 tables ... "Articles", "Top5PercentArticles".
Because the data is now separated out in to 2 smaller subsets of data the indexes are smaller and the read and write requests on a single table in the database will drop.
it's not ideal from a business layer point of view as you would then need some way to list what data is stored in what tables but that's a completely separate problem altogether.
Failing that your only option past execution plans is to scale up your server platform.
I'm writing an application that displays posts, much like a simple forum system.
Posts can be marked as urgent by users when they submit the post, but a moderator must approve the "urgent" status. A post will still be displayed even if it has not been approved as urgent, but will just appear to be a normal post until the moderator approves the urgent status at which point the post will get special treatment.
I have considered two approaches for this:
1) have two flags in the posts tables. One to say the user has requested urgent status, and a second to indicate if admin has approved urgent status. Only if both are true will the post be show as being urgent.
2) have two tables. A requests pending table which holds all the pending urgent approvals. Once admin approves urgent status, I would delete the request from the pending table and update the posts table so that the urgent field becomes true for that post.
I'm not sure if either approach is better than the other.
The first solution means I only have one table to worry about, but it ends up with more fields in it. Not sure if this actually makes querying any slower or not, considering that the posts table will be the most queried table in the app.
The second solution keeps the posts table leaner but ads another table to deal with (not that this is hard).
I'm leaning towards the second solution, but wonder if I'm not over analysing hings and making my life more complicated than it needs to be. Advice?
Definitely 1). The additional table just messes things up. One extra status field is enough, with values: 0=normal, 1=urgent_requested, 2=urgent_approved for example.
You could query with status=1 for queries needing approval, and if you order by status desc, you naturally get the urgent messages up front.
There's another solution that comes in mind :)
You can have a table with post statuses and in your Posts table you will have a columnt which references a status..
This apporach has several advantages - like you can seamlessly add more statuses in the future.. or you can even have another table holding rules how statuses can be changed (workflow).
The second approach is cleanest in terms of design, and it will probably end up using less disk space. The first approach is less "pure", but from a maintenance and coding point of view it's simpler; hence I'd go with this approach.
Also, it's great to see someone thinking about design before they go off and write reams of code. :) Can't tell you how many messed-up projects I've seen, where a single hour of thinking about the design would've saved many hours of effort for all involved...
I think option 1 is the best. The only thing you need to do is make an index with the two fields.
The option 2 add too much complexity
I have a mysql query just for things like this. I will post it as soon as I remember/find the correct syntax