I'm implementing a Leaderboard into my django web app and don't know the best way to do it. Currently, I'm just using SQL to order my users and, from that, make a Leaderboard, however, this creates two main problems:
Performance is shocking. I've only tried scaling it to a few hundred users but I can tell calculating ranking is slow and excessive caching is annoying since I need users to see their ranking after they are added to the Leaderboard.
It's near-impossible to tell a user what position they are without performing the whole Leaderboard calculation again.
I haven't deployed but I estimate about 5% updates to Leaderboard vs 95% reading (probably more, actually) the Leaderboard. So my latest idea is to calculate a Leaderboard again each time a user is added, with a position field I can easily sort by, and no need to re-calculate to display a user's ranking.
However, could this be a problem if multiple users are committing at the same time, will locking be enough or will rankings stuff up? Additionally, I plan to put this on a separate database solely for these leaderboards, which is the best? I hear good things about redis...
Any better ways to solve this problem? (anyone know how SO makes their leaderboards?)
I've written a number of leaderboards libraries that would help you out there. The one that would be of immediate use is python-leaderboard, which is based on the reference implementation leaderboard ruby gem. Using Redis sorted sets, your leaderboard will be ranked in real-time and there is a specific section on the leaderboard page with respect to performance metrics for inserting a large number of members in a leaderboard at once. You can expect to rank 1 million members in around 30 seconds if you're pipelining writes.
If you're worried about the data changing too often in real-time, you could operate Redis in a master-slave configuration and have the leaderboards pull data from the slave, which would only poll periodically from the master.
Hope this helps!
You will appreciate the concept of sorted sets in Redis.
Don't miss the paragraph which describes your problem :D
Make a table that stores user id and user score. Just pull the leader board using
ORDER BY user_score DESC
and join the Main table for the User name or whatever else you need.
Unless the total number of tests is a variable in your equation, the calculation from your ranking system should stay the same for each user so just update individual entries.
Related
Let's say we want to design a gaming platform for chess and our userbase is ~50M. And whenever a player comes to our platform to play chess we assign a random player with almost the same rating as an opponent. Every player has 4 level rating [Easy, Medium, Hard, Expert].
My approach was to keep all user in Redis cache(assume all users/boats are live and waiting for the opponent) so we are keeping data in the below format:
"chess:easy" : [u1, u2, u3]
"chess:medium" : [u4, u5, u6]
so when a user comes I will remove a user from the cache and assign.
For example: u7 (easy) wants to play chess games then his opponent will be u1(easy).
But won't this create a problem for concurrent requests as we read then remove from Redis List that will be blocking?
Can anyone suggest a better approach with or without cache?
But won't this create a problem for concurrent requests as we read then remove from Redis List that will be blocking?
No, because Redis is single-threaded when it performs write but, assuming the 50M users are equality distributed across the chess level, you would get 4 lists, one per difficulty, with a length of 12,5 million
Manipulating the list could have a lot of complexity because using [LPOP][1] to select a user and remove it has O(N) complexity and the last user may wait a lot of time before getting an opponent.
I think you should aim to use a HASH data structure and splitting the users in different databases:
db0: Easy
db1: Medium
db2: Hard
db3: Expert
In this way, you can store the users that want to play with HSET <USERID> status "ready to play" and then benefit the RANDOMKEY command to select an opponent and delete it with HDEL <KEY RETURNED BY RANDOM> status.
Doing so, you will execute commands O(1) only, providing a fast and reliable matching system that you can optimize further with the redis pipeling feature.
NB: if you set and hash per difficulty level and adding multiple fields to the hash, you will hit the O(N) complexity due to the HDEL command!
[1]: https://redis.io/commands/lpop
Can anyone suggest a better approach with or without cache?
There are many algorithms you may use such as:
Stable marriage problem
Exact cover
but it depends on the user experience you want to implement for your clients.
You can explore the gamedev.stackexchange.com community
I'm refactoring my current schema and it's too abstract for me.
I monitor my servers with a homemade monitoring software. This software sends HTTP requests to a Rails web server with about ten different fields worth of information so I can get a quick overview of everything.
My current implementation:
server [id, name, created_date, edited_date, ..., etc ]
status_update [id, server_id, field1, field2, field3, created_date, edited_date, ..., etc]
I treat the servers as Users and status updates as Tweets. I delete any status_update on a server_id older than the tenth one just to keep from growing to infinity.
Though I'm starting to run into a few complications. I need to display information from the most recent status_update on the index page, I need to sort the servers based on status_update info, I need to store info from certain status_updates that may be way older than 10 status_updates old. It also seems like I'm going to start needing to store information from status_updates in both the server and status_update, which would cause hitting the DB multiple times on an insert. Thus, I am looking to refactor.
My requirements:
I only need to display information from the most recent update.
Having the next 9 status_updates helps debug if the system goes offline.
I need to be able to sort based on some info from most recent status_update.
I need the database to remain small (Heroku free).
Ideal performance, IE not hitting database more than once unless necessary.
Non-Complicated DB structure so I can pass it along.
Edit: Additional Info => I am looking to ultimately monitor about 150-200 servers (a lot for a hobby dev, but I'm cheap). Each monitoring service posts every five minutes or so unless something goes wrong. So, worst case scenario has me reaching max capacity every four hours.
I was thinking it would be nice to track when the last time X event happened, and what the result was. Thus, tracking that information would have to be moved to the server model itself since I'm wiping out old records and would lose the information after an hour or so. Though in retrospect, I could just save that info in memory using the monitoring service and send it up every five minutes or only once each time it changes. I could also simply edit that information only when it changes, so as to process less information on each request. Hm!
Efficiency
All ORMs, including ActiveRecord, are designed and built around certain tradeoffs. It's commonplace for ORMs to use several simple SELECT statements to do what a SQL developer would do with a single SELECT statement. You're probably not overwhelming Heroku with your queries.
There's no reasonable structural solution to this problem.
Size
Your "status_update" table should be able to hold an enormous number of rows. Heroku's hobby-dev plan allows 10,000 rows. How many servers do you seriously expect to monitor on a free plan? If I were you, I would delete old rows from it no more than once a day, or when I got a permission error. (On Heroku, certain permission errors mean you're over the row limit.)
It also seems like I'm going to start needing to store information
from status_updates in both the server and status_update, which would
cause hitting the DB multiple times on an insert.
This really makes little sense. Tweets don't require updates to the user account; status updates don't require updates about the server. This might suggest refactoring is in order, but I'd want to see either your models or your CREATE TABLE statements to be sure. (You can paste those into your question, and leave a comment here.)
Alternatives
I'd seriously consider running this Rails app on a local machine, writing data to a database on the local machine, especially if you intend to target 200 web servers. This would eliminate all Heroku row limits, and you don't really need to run it 24 hours a day if this is just a hobby. If you're doing this professionally, your income from it should easily cover the cost of a hobby-basic plan on Heroku. (Currently $9.00/month.) But even then I'd think hard about hosting this locally.
I am building a site that allows users to view and do some activities (vote, comments,...) on articles. I am using MySql as main storage. In order to improve performance, I am considering using Redis (4.x) to handle some view activities such as top/hot articles...
I am gonna use one sortedSet, called topAticleSortedSet, to store top articles, and this set will be updated frequently every time a user vote or somment on a certain article.
Since each user will login and follow some topics and I also need to filter and display articles in the topArticleSortedSet based on users' following topics.
There is of course scroll paging as well.
For those reasons, I intend to create one topArticleSortedSet for each user and that way each user will have one independent list. But I dont know if this is best practice because there might be million of logged-in users access in my site (then it would be million of sets which is around 1000 article items for each).
Can anyone give me some advice please?
I think you should keep to one Set, and filter it for each user, instead of having a Set per user. Here is why:
My understanding is that the Set have to be updated each time someone reads an article (incrementing a counter probably).
Let's say you have n users, each one reading p articles per day. So you have to update the Set n*p times a day.
In the "single" set option, you will need to update just one set when there is an article read. So it makes a total of n*p updates. In the "one set per user" architecture, you will need to do n*p*n updates, which is much bigger.
Of course, filtering a single Set will take you some time, longer than accessing a Set designed for one user. But on average, I guess it would take you much less time than n operations. Basically, you need to know which is faster: filtering one Set or updating n Sets ?
There are many accounts, which get events (data points with timestamps) stored in realtime. I discovered that it is a good idea to store events using a sorted set. I tried to store events for multiple accounts in a one sorted set, but then didn't figure out how to filter events by account id.
Is it a good idea to create multiple sorted sets for each account (> 1000 accounts)?
Questions:
How long will you keep these events in memory ?
Your number of accounts won't grow ?
Are you sure you will have enough memory ?
... but yes, you should definitely create a sorted set for each account, that's the state of art when using Redis.
However, if it's all about real-time events (storing and retrieval) you may want to give a try to a database like InfluxDB that provides a powerful SQL-like query system. It seems a better answer to your problem.
On my website, there exists a group of 'power users' who are fantastic and adding lots of content on to my site.
However, their prolific activities has led to their profile pages slowing down a lot. For 95% of the other users, the SPROC that is returning the data is very quick. It's only for these group of power users, the very same SPROC is slow.
How does one go about optimising the query for this group of users?
You can assume that the right indexes have already been constructed.
EDIT: Ok, I think I have been a bit too vague. To rephrase the question, how can I optimise my site to enhance the performance for these 5% of users. Given that this SPROC is the same one that is in use for every user and that it is already well optimised, I am guessing the next steps are to explore caching possibilities on the data and application layers?
EDIT2: The only difference between my power users and the rest of the users is the amount of stuff they have added. So I guess the bottleneck is just the sheer number of records that is being fetched. An average user adds about 200 items to my site. These power users add over 10,000 items. On their profile, I am showing all the items they have added (you can scroll through them).
I think you summed it up here:
An average user adds about 200 items
to my site. These power users add over
10,000 items. On their profile, I am
showing all the items they have added
(you can scroll through them).
Implement paging so that it only fetches 100 at a time or something?
Well you can't optimize a query for a specific result set and leave the query for the rest unchanged. If you know what I mean. I'm guessing there's only one query to change, so you will optimize it for every type of user. Therefore this optimization scenario is no different from any other. Figure out what the problem is; is it too much data being returned? Calculations taking too long because of the amount of data? Where exactly is the cause of the slowdown? Those are questions you need to ask yourself.
However I see you talking about profile pages being slow. When you think the query that returns that information is already optimized (because it works for 95%), you might consider some form of caching of the profile page content. In general, profile pages do not have to supply real-time information.
Caching can be done in a lot of ways, far too many to cover in this answer. But to give you one small example; you could work with a temp table. Your 'profile query' returns information from that temp table, information that is already calculated. Because that query will be simple, it won't take that much time to execute. Meanwhile, you make sure that the temp table periodically gets refreshed.
Just a couple of ideas. I hope they're useful to you.
Edit:
An average user adds about 200 items to my site. These power users add over 10,000 items.
On their profile, I am showing all the
items they have added (you can scroll
through them).
An obvious help for this will be to limit the number of results inside the query, or apply a form of pagination (in the DAL, not UI/BLL!).
You could limit the profile display so that it only shows the most recent 200 items. If your power users want to see more, they can click a button and get the rest of their items. At that point, they would expect a slower response.
Partition / separate the data for those users then the tables in question will be used by only them.
In a clustered environment I believe SQL recognises this and spreads the load to compensate, however in a single server environment i'm not entirely sure how it does the optimisation.
So essentially (greatly simplified of course) ...
If you havea table called "Articles", have 2 tables ... "Articles", "Top5PercentArticles".
Because the data is now separated out in to 2 smaller subsets of data the indexes are smaller and the read and write requests on a single table in the database will drop.
it's not ideal from a business layer point of view as you would then need some way to list what data is stored in what tables but that's a completely separate problem altogether.
Failing that your only option past execution plans is to scale up your server platform.