I am building a site that allows users to view and do some activities (vote, comments,...) on articles. I am using MySql as main storage. In order to improve performance, I am considering using Redis (4.x) to handle some view activities such as top/hot articles...
I am gonna use one sortedSet, called topAticleSortedSet, to store top articles, and this set will be updated frequently every time a user vote or somment on a certain article.
Since each user will login and follow some topics and I also need to filter and display articles in the topArticleSortedSet based on users' following topics.
There is of course scroll paging as well.
For those reasons, I intend to create one topArticleSortedSet for each user and that way each user will have one independent list. But I dont know if this is best practice because there might be million of logged-in users access in my site (then it would be million of sets which is around 1000 article items for each).
Can anyone give me some advice please?
I think you should keep to one Set, and filter it for each user, instead of having a Set per user. Here is why:
My understanding is that the Set have to be updated each time someone reads an article (incrementing a counter probably).
Let's say you have n users, each one reading p articles per day. So you have to update the Set n*p times a day.
In the "single" set option, you will need to update just one set when there is an article read. So it makes a total of n*p updates. In the "one set per user" architecture, you will need to do n*p*n updates, which is much bigger.
Of course, filtering a single Set will take you some time, longer than accessing a Set designed for one user. But on average, I guess it would take you much less time than n operations. Basically, you need to know which is faster: filtering one Set or updating n Sets ?
Related
I am facing this problem when calculating the inbox for a user:
On one hand I have a bunch of documents that can potentially have
many readers (DOCS table).
Each reader belongs to one or more defined groups of users.
I have a table DOC_ACCES_BY_GROUP with (DOC_ID, GROUP_ID)
I need to know if a user has read a document or not. So, I have a table DOC_UNREAD with (DOC_ID, USER_ID) so that if a document is in that table, the user has not read the document yet.
Then each group can change in participants at any time, so I need to calculate my "inbox" for a certain user in real time.
The first guess is: Calculate all the groups in which a user is involved, then make a join between all the DOCS and the DOC_ACCESS_BY_GROUP table to get all the documents for that user (with the data asociated), and then another join to see if that document is read or not for the user.
The problem is, when my DOCS table grows considerably and I have many users, and many groups... the performance is really poor.
I'm trying to abstract the problem, which is actually a bit more complex. The possibility of storing document permissions per user is discarded. I also imagine it's not a problem that can be solved by optimizing the SQL query but should be done by software. We also support many data bases such as Mysql, Posgre or MSSQL so it can not be linked to a specific vendor solution (I guess).
So, the question is: Does anyone know any mechanism or framework or algorithm to do things differently and solve this problem, in an optimal and performant way?
Memcached? Infinispan? Hadoop?
You probably want to "materialize" the inbox and update it every time the user reads something, the membership of a group changes etc. The materialized inbox could be stored either in a DB table or in a separate system like Infinispan/memcached.
There are many accounts, which get events (data points with timestamps) stored in realtime. I discovered that it is a good idea to store events using a sorted set. I tried to store events for multiple accounts in a one sorted set, but then didn't figure out how to filter events by account id.
Is it a good idea to create multiple sorted sets for each account (> 1000 accounts)?
Questions:
How long will you keep these events in memory ?
Your number of accounts won't grow ?
Are you sure you will have enough memory ?
... but yes, you should definitely create a sorted set for each account, that's the state of art when using Redis.
However, if it's all about real-time events (storing and retrieval) you may want to give a try to a database like InfluxDB that provides a powerful SQL-like query system. It seems a better answer to your problem.
I'm implementing a Leaderboard into my django web app and don't know the best way to do it. Currently, I'm just using SQL to order my users and, from that, make a Leaderboard, however, this creates two main problems:
Performance is shocking. I've only tried scaling it to a few hundred users but I can tell calculating ranking is slow and excessive caching is annoying since I need users to see their ranking after they are added to the Leaderboard.
It's near-impossible to tell a user what position they are without performing the whole Leaderboard calculation again.
I haven't deployed but I estimate about 5% updates to Leaderboard vs 95% reading (probably more, actually) the Leaderboard. So my latest idea is to calculate a Leaderboard again each time a user is added, with a position field I can easily sort by, and no need to re-calculate to display a user's ranking.
However, could this be a problem if multiple users are committing at the same time, will locking be enough or will rankings stuff up? Additionally, I plan to put this on a separate database solely for these leaderboards, which is the best? I hear good things about redis...
Any better ways to solve this problem? (anyone know how SO makes their leaderboards?)
I've written a number of leaderboards libraries that would help you out there. The one that would be of immediate use is python-leaderboard, which is based on the reference implementation leaderboard ruby gem. Using Redis sorted sets, your leaderboard will be ranked in real-time and there is a specific section on the leaderboard page with respect to performance metrics for inserting a large number of members in a leaderboard at once. You can expect to rank 1 million members in around 30 seconds if you're pipelining writes.
If you're worried about the data changing too often in real-time, you could operate Redis in a master-slave configuration and have the leaderboards pull data from the slave, which would only poll periodically from the master.
Hope this helps!
You will appreciate the concept of sorted sets in Redis.
Don't miss the paragraph which describes your problem :D
Make a table that stores user id and user score. Just pull the leader board using
ORDER BY user_score DESC
and join the Main table for the User name or whatever else you need.
Unless the total number of tests is a variable in your equation, the calculation from your ranking system should stay the same for each user so just update individual entries.
On my website, there exists a group of 'power users' who are fantastic and adding lots of content on to my site.
However, their prolific activities has led to their profile pages slowing down a lot. For 95% of the other users, the SPROC that is returning the data is very quick. It's only for these group of power users, the very same SPROC is slow.
How does one go about optimising the query for this group of users?
You can assume that the right indexes have already been constructed.
EDIT: Ok, I think I have been a bit too vague. To rephrase the question, how can I optimise my site to enhance the performance for these 5% of users. Given that this SPROC is the same one that is in use for every user and that it is already well optimised, I am guessing the next steps are to explore caching possibilities on the data and application layers?
EDIT2: The only difference between my power users and the rest of the users is the amount of stuff they have added. So I guess the bottleneck is just the sheer number of records that is being fetched. An average user adds about 200 items to my site. These power users add over 10,000 items. On their profile, I am showing all the items they have added (you can scroll through them).
I think you summed it up here:
An average user adds about 200 items
to my site. These power users add over
10,000 items. On their profile, I am
showing all the items they have added
(you can scroll through them).
Implement paging so that it only fetches 100 at a time or something?
Well you can't optimize a query for a specific result set and leave the query for the rest unchanged. If you know what I mean. I'm guessing there's only one query to change, so you will optimize it for every type of user. Therefore this optimization scenario is no different from any other. Figure out what the problem is; is it too much data being returned? Calculations taking too long because of the amount of data? Where exactly is the cause of the slowdown? Those are questions you need to ask yourself.
However I see you talking about profile pages being slow. When you think the query that returns that information is already optimized (because it works for 95%), you might consider some form of caching of the profile page content. In general, profile pages do not have to supply real-time information.
Caching can be done in a lot of ways, far too many to cover in this answer. But to give you one small example; you could work with a temp table. Your 'profile query' returns information from that temp table, information that is already calculated. Because that query will be simple, it won't take that much time to execute. Meanwhile, you make sure that the temp table periodically gets refreshed.
Just a couple of ideas. I hope they're useful to you.
Edit:
An average user adds about 200 items to my site. These power users add over 10,000 items.
On their profile, I am showing all the
items they have added (you can scroll
through them).
An obvious help for this will be to limit the number of results inside the query, or apply a form of pagination (in the DAL, not UI/BLL!).
You could limit the profile display so that it only shows the most recent 200 items. If your power users want to see more, they can click a button and get the rest of their items. At that point, they would expect a slower response.
Partition / separate the data for those users then the tables in question will be used by only them.
In a clustered environment I believe SQL recognises this and spreads the load to compensate, however in a single server environment i'm not entirely sure how it does the optimisation.
So essentially (greatly simplified of course) ...
If you havea table called "Articles", have 2 tables ... "Articles", "Top5PercentArticles".
Because the data is now separated out in to 2 smaller subsets of data the indexes are smaller and the read and write requests on a single table in the database will drop.
it's not ideal from a business layer point of view as you would then need some way to list what data is stored in what tables but that's a completely separate problem altogether.
Failing that your only option past execution plans is to scale up your server platform.
This is not SO Meta question. I am using SO only as example.
In StackoverFlow each answer, each comment, each question, each vote has a effect which produces a badge at some point of time. I mean after every action a list of queries are tested.
E.g. If Mr.A up votes Mr.B Answer. So we have to check is this Mr.B's answer upvoted 100 times so give the Mr.B a badge , Has Mr.A upvoted 100th time so give him a badge.
It means I have to run at least 100 queries/IfElse for each action.
Now my real life example is I have an application where I receive online data from an attendance machine. When a user shows his card to machine. I receive this and store it as a record. Now based on this record I have multiple calculations.i.e Is he late. Is he late for continues 3 days. Is he in a right shift(Day shift/Night Shift). Is today holiday. Is this a overtime. Is he early.......etc.,etc.,etc.
What is the best strategy for this kind of requirements.
Update:
Can SO team guide us on this?
You use queues and workflows. This way you decouple the moment of the update from the actual notifications, allowing the system to scale. Tighly coupled, trigger based or similar solutions cannot scale, as each update has to wait for all the interested parties to react to the notification. The design of processing engine using workflows allows to easily add steps and notifications consumers by changing the data, w/o changing the schema.
For instance see how MSDN uses queues to handle similar problems with MSDN content: Building the MSDN Aggregation System.
Couldn't you just use "flags" (other tables, other columns, whatever) to indicate when those special cases occur? That way you would only have to do one lookup (per special case) than a ton of lookups and/or joins. You could record the changes (third day late, etc.) on insert.
Also, what to check depends on a threshold.
e.g. Is a person absent from last 3 days? That check is required only when the person is absent for 2 days.
I mean - you need not check everything, everytime.
Also, how much of info needs to be updated immediately? SO doesn't update things real time.
May be you must use two databases with online replication between them - one for getting realtime data and nothing else, in second you may use hard calculations (for example calculate all latings every 10 minutes or by requsts). Locate this databases on different servers.