Elo ratings: how many votes must be cast to get a reliable outcome?

Elo ratings: how many votes must be cast to get a reliable outcome? - system

I want to setup a system that will end up ranking competitors against one another based on the votes. In this example, there will be 250 competitors, but only 4 people able to cast votes. We ideally want it setup in a hot-or-not fashion (using the Elo rating system), but I wonder how many votes must be cast before we'd get a fair ranking?
Does anyone have any thoughts on how I might establish a fair(ish) rating without each voter casting thousands of votes?

It depends on your k-factor, i.e. how quickly you want ratings to correct to changes in skill.
If you use a higher k-factor, the rankings will quickly approximate the skill of competitors. However, in that case the ranking will be mostly a short term value, with chance, pairings and "bad days" affecting it greatly.
Using a multiple level k-factor system, like the chess world does, lets you both quickly converge to approximate ratings for new players (and the initial set of players) and track a longer term ranking for established players.
I would recommend starting with the values FIDE uses, so you don't have to retest extensively:
400 as a denominator in the exponents, so that 200 points' difference = 75% winning chance
k = 30 for the first 30 games
k = 20 after that until the player has reached 2400 ranking at least once
k = 10 thereafter
If 30 games is too much for the initial period, you could use a lower number but increase the initial k proportionally. Beware that this will make the initial ranking very variable.
If you want a different normalization than the 200 points -> 75%, you can divide all the numbers above by the same constant.

Related

Is there an algorithm for near optimal partition of the Travelling salesman problem, creating routes that need the same time to complete?

I have a problem to solve. I need to visit 7962 places with a vehicle. The vehicle travels with 10km/h and each time I visit one place I stay there for 1 minute. I want to divide those 7962 places into subsets that take will take up to 8 hours. So lets say 200 places take 8 hours I visit them and come back the next day to visit another maybe 250 places(the 200 places subsets will require more distance travelled). For the distance I only care for Euclidean Distances no need to take into account the distance through the road network.
A map of the 7962 places
What I have done so far is use the k means clustering algorithm to get good enough subsets and then the Lin Kernighan heuristic (Program Concorde) to find the distance. And then compute times. But my results go from 4 hours to 12 hours. Any idea to make it better? Or a code that does this whole task all together. Propose anything but I am not a programmer I just use Python some times.
Set of coordinates :
http://www.filedropper.com/wholesetofcoordinates
Coordinates subsets(40 clusters produces with the k means algorithm):
http://www.filedropper.com/kmeans40clusters

Redis: Maximum score size for sorted sets? Score + Unique ids = Unique Scores?

I'm using timestamps as the score. I want to prevent duplicates by appending a unique object-id to the score. Currently, this id is a 6 digit number (the highest id right now is 221849), but it is expected to increase over a million. So, the score will be something like
1407971846221849 (timestamp:1407971846 id:221849) and will eventually reach 14079718461000001 (timestamp:1407971846 id:1000001).
My concern is not being able to store scores because they've reached the max allowed.
I've read the docs, but I'm a bit confused. I know, basic math. But bear with me, I want to get this right.
Redis sorted sets use a double 64-bit floating point number to represent the score. In all the architectures we support, this is represented as an IEEE 754 floating point number, that is able to represent precisely integer numbers between -(2^53) and +(2^53) included. In more practical terms, all the integers between -9007199254740992 and 9007199254740992 are perfectly representable. Larger integers, or fractions, are internally represented in exponential form, so it is possible that you get only an approximation of the decimal number, or of the very big integer, that you set as score.
There's another thing bothering me right now. Would the increase in ids break the chronological sort sequence ?
I will appreciate any insights, suggestions, different prespectives or flat out if what I'm trying to do is non-sense.
Thanks for any help.

No, it won't break the "chronological" order, but you may loose the precision of the last digits, so two members may end up having the same score (i.e. non-unique).

There is no problem with duplicate scores. It is just maintaining a sorted set in memory. Members are unique but the scores may be the same. If you want chronological processing I would just rely on the timestamp without adding an id to it.
Appending an id would break the chronological sort if your ids are mixed such that you could have timestamps 1, 2, 3 (simple example) and ids 100, 10, 1, you won't get the correct sort. If your ids will always be added monotonically then you should just use the id as the score.

Ranking algorithm in a rails app

We have a model in our ralis app whose objects are assigned a score based on positive user actions. We'll call them products for simplicity sake. If a user likes a product or buys a product or views a product, the score is incremented at various weights (a like might be worth more than a view, two views in the span of 30 seconds might be worth more than three views spread over an hour, etc.)
We'd like to use these scores to help sort and rank products, say for a popular products list, but for various reasons -- using the straight ranking is going to unevenly favor older products, since they'll have more time to amass a higher score.
My question is, how to normalize the scores between new and old products. I thought about dividing the products score by a unit of time, say the number of days it's been in existence, but am worried that will cut down the older products too much. Any thoughts on the best way to fairly normalize the scores between the old and new products?
I'm also considering an example of a bayesian rating system I found in another question:
rating = ((avg_num_votes * avg_rating) + (product_num_votes * product_rating)) / (avg_num_votes + product_num_votes)
Where theavg numbers are calculated by looking at the scores across all products that have more than one vote (or in our case, a positive action). This might not be the best way, because we don't have a negative rating in our system and it doesn't take time into consideration at all.

Your question reminds me the concept of Exponential Discounting Cash Flow in finance.
The concept is the following : 100$ in two years worth less than 100$ in one year, which worth less than 100$ now, ...
I think that we can make a good comparison here : a product of yesterday worth more that a product of the day before but less than a product of today.
The formula is simple :
Vn = V0 * (1-t)^n
with V0 the initial value (the real number of positives votes), t a discount rate (you have to fix it, like 10%) and n the time passed (for example n days). Thus a product will lose 10% of his value each day (but 10% of the precedent day, not of the initial value).
You can also see Hyperbolic discounting that is closer of your try. The formula can be sometyhing like that I guess :
Vn = V0 * (1/(1+k*n))
An other approach, simpler, but crudest : linear discounting. You can simply give an initial value for the scores, like 1000 and each day, you decrement all scores by 1 (or an other constant).
Vn = V0 - k*n

How can I test that my hash function is good in terms of max-load?

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):
Probability that a bin is empty, for large n is 1/e.
Expected number of empty bins is n/e.
Probability that a bin has k balls is <= 1/ek! (corrected).
Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).
These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.
Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ).
Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.
Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?
This lecture seems to cover it best, but I am no mathematician.
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
BTW, with high probability seems to mean 1 - 1/n.

That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.
I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.
First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.
With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.
Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.
Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.
This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.
EDIT: a proof of why the base of the log function doesn't matter with big-O notation:
log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)
1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.

Here is a rough start to the solution of this problem involving uniform distributions and maximum load.
Instead of bins and balls or urns or boxes or buckets or m and n, people (p) and doors (d) will be used as designations.
There is an exact expected value for each of the doors given a certain number of people. For example, with 5 people and 5 doors, the expected maximum door is exactly 1.2864 {(1429-625) / 625} above the mean (p/d) and the minimum door is exactly -0.9616 {(24-625) / 625} below the mean. The absolute value of the highest door's distance from the mean is a little larger than the smallest door's because all of the people could go through one door, but no less than zero can go through one of the doors. With large numbers of people (p/d > 3000), the difference between the absolute value of the highest door's distance from the mean and the lowest door's becomes negligible.
For an odd number of doors, the center door is essentially zero and is not scalable, but all of the other doors are scalable from certain values representing p=d. These rounded values for d=5 are:
-1.163 -0.495 0* 0.495 1.163
* slowly approaching zero from -0.12
From these values, you can compute the expected number of people for any count of people going through each of the 5 doors, including the maximum door. Except for the middle ordered door, the difference from the mean is scalable by sqrt(p/d).
So, for p=50,000 and d=5:
Expected number of people going through the maximum door, which could be any of the 5 doors, = 1.163 * sqrt(p/d) + p/d.
= 1.163 * sqrt(10,000) + 10,000 = 10,116.3
For p/d < 3,000, the result from this equation must be slightly increased.
With more people, the middle door slowly becomes closer and closer to zero from -0.11968 at p=100 and d=5. It can always be rounded up to zero and like the other 4 doors has quite a variance.
The values for 6 doors are:
-1.272 -0.643 -0.202 0.202 0.643 1.272
For 1000 doors, the approximate values are:
-3.25, -2.95, -2.79 … 2.79, 2.95, 3.25
For any d and p, there is an exact expected value for each of the ordered doors. Hopefully, a good approximation (with a relative error < 1%) exists. Some professor or mathematician somewhere must know.
For testing uniform distribution, you will need a number of averaged ordered sessions (750-1000 works well) rather than a greater number of people. No matter what, the variances between valid sessions are great. That's the nature of randomness. Collisions are unavoidable. *
The expected values for 5 and 6 doors were obtained by sheer brute force computation using 640 bit integers and averaging the convergence of the absolute values of corresponding opposite doors.
For d=5 and p=170:
-6.63901 -2.95905 -0.119342 2.81054 6.90686
(27.36099 31.04095 33.880658 36.81054 40.90686)
For d=6 and p=108:
-5.19024 -2.7711 -0.973979 0.734434 2.66716 5.53372
(12.80976 15.2289 17.026021 18.734434 20.66716 23.53372)
I hope that you may evenly distribute your data.
It's almost guaranteed that all of George Foreman's sons or some similar situation will fight against your hash function. And proper contingent planning is the work of all good programmers.

After some more research and trial-and-error I think I can provide something part way to to an answer.
To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.
max-load can be defined for any probability you like. The typical formula used is
1-1/n**c
Most papers on the topic use
1-1/n
An example might be easiest.
Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.
The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).
Using the formula for the probability of getting exactly k identical hash values
Pr[ exactly k ] = ((e/k)**k)/e
then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.
eg.
Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003 // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007
So, in this case, the max-load is 5.
So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.
If it isn't then this could be due to the following reasons:
Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).
Your hash function doesn't work well with your data - try it with random data.
Your hash function doesn't work well.
The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.
Incidentally, my hash function worked nicely - except on short (1..4 character) strings.
I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.
I hope this helps.

Elo rating system: start value when players can join the game constantly

I've implemented an Elo rating system in a game. There is no limit for the number players. Players can join the game constantly so the number of players probably rises gradually.
How the Elo values are exactly calculated isn't important because of this fact: If team A beats team B then A's Elo win equals B's Elo loss.
Hence I've got a problem concerning the starting values for my rating system:
Should I use the starting value "0" for every player? The sum of all Elo values would be constant. But since the number of players is increasing there would be some kind of Elo deflation, wouldn't it?
Should I use any starting value greater than 0? In this case, the sum of all Elo values would constantly increase. So there could be an Elo inflation. The problem: Elo points lose value but the starting value keeps always the same.
What should I do? Can you help me? Thanks in advance!

You can start at zero and add a fudge factor to the displayed score to keep it above zero, or you can start at 1000 - they are the same thing. Yes, with the 1000 starting point you'll have an increasing number of total ELO points in the system but it will always be the same number per player on average - 1000. The starting value for Elo is always the current average. ELO is a zero sum game, points lost by player A are gained by player B.
When you set a starting point at 1000 what you are essentially saying is that the average player = 1000 pts. With a closed group of initial players (beta testers?) this is true, within that group average = 1000. But if the game is something you improve at with time then your closed group average player becomes highly skilled compared to someone who hasn't played.
Now when you assign a 1000 to a new player you are saying new average players = existing highly skilled average player. This is not true, they are likely to be much less skilled that your original closed group. So the new player loses points and your highly skilled players gain => inflation. What you would need to do is accurately assess the skill of new players and assign them a ranking that is more in keeping with their actual skill. This could be done be assigning them a "provisional ranking" for their first x games until you get a feel for their skill. When provisionally ranked only their ELO score would change, not those of the people they play. Once they join the real system the points they bring into the scored ELO would roughly equate to their actual skill and they wouldn't move up or down dramatically => no inflation or deflation.
In short: Provisional rankings

This site used the elo rating system. They start at 1200
taken from http://gameknot.com/help-answer.pl?question=29
GameKnot rating system is based on Elo rating system with a fixed K = 20 and the following modifications:
The first 20 games are used to establish player's rating on the website. During the first 20 games, the player's rating is calculated as an average of the ratings of all his/her opponents, +400 in case of a win, -400 in case of a loss, same for a draw. +/-200 points are used when playing against a player with a provisional rating.
PLayer's rating is provisional during the first 20 games, after which it becomes established.
Player's rating is considered to be equal to 1200 during the first 5 rated games.
Timeouts are counted as wins only if there were at least 3 moves made in the game (losses are always counted for the timed out players, regardless of how many moves were made).
The higher of two ratings, at the beginning of the game and at the end, is used to calculate the rating adjustments after the game is over.
For example, if during your first 20 rated games, you played 3 games and you won against 1200 player with provisional rating, then against 1400 player with established rating, but lost against 1600 player with established rating, your rating will be:
( (1200 + 200) + (1400 + 400) + (1600 - 400) ) / 3 = 1467
Or, if during your first 20 rated games, you win against 1200 provisional, win against 1400 established, lose against 1600 provisional, draw against 1500, your rating will be:
( (1200 + 200) + (1400 + 400) + (1600 - 200) + 1500 ) / 4 = 1525

Elo works on the difference in the ratings of the two players or teams, the actual value is irrelevant. You can start at any number you like. I run a system for the Facebook Scrabble League with a starting value of 5000 simply to distinguish it from other Scrabble ratings systems out there.
Inflation does not result from over-rated newcomers losing points to experienced players - that all evens out in time. Inflation results from people with less than average ratings leaving the system. This is what tends to happen in online games unlike real life chess, where deflation is a problem because highly rated players retire and take their points out of the system.
But do you need to worry about inflation? The only time it is important is if you wish to compare the performance of current players with historical figures - not a problem any online gamers are likely to face. Even if you do worry about inflation, it's easy to correct. Find the mean rating of all your current players and compare it with the start figure, if it's too high, reduce everyone's rating to bring it back in line. In my experience a reduction of 1 or 2 points per ratings period does the trick and accounts for a lot of newcomers who get thrashed and don't come back.
Many systems give newcomers higher K values in order that they find their level more quickly.
Another approach is not to rate newcomers until they have played their first ratings period, at which point you calculate the rating based on what if would have to have been for Elo to predict the results correctly. This is impossible if you have all unrated players together and (I think) would involve recursion if you have multiple newcomers in a tournament. It also undermines the zero-sum principle of Elo and removes your ability to measure inflation. However, talking to people who use this system, I am told "it all evens out" in practice.
I would also add that we have one determined player who has lost all but 21 out of 732 games he's played in the last four years and his Elo is still about 4100. The ratings points he loses each round are rapidly approaching zero.

I don't know if it is useful, but Mark Glickman's Ratings Page discusses some issues with Elo ratings, their declines, etc. (see the last few paragraphs there). Also see his rating system, the Glicko system, which seems to account for playing frequency and discusses rating reliability. Finally, his research page has a lot of papers discussing ratings and their reliability.
Hope that helps.

Do players see this score?
Will the players understand Elo?
Will players continue to play if their score becomes negative?
I would start everyone out at some positive point value (10, 100, 1000, it doesn't matter). When two people of relatively capability play each other, the scores trade as expected. Where you need to concentrate is some sort of relative capability between two players.
Suppose, later on in the game's life, I have 25000 points, and you're a n00b with 100. I beat you, I gain nothing and you lose nothing. Why? Because I just pwned a n00b, that's why. There should be no advantage for a new player to take down a starting player. Also, even if you are in some point range, you should implement something where you can only earn so many points from a given player in a certain time range.
Obviously, this will be something that will be continually tweaked throughout your game's life time.

Maybe, make points not be able to go below the starter about and put the amount of he loss in some "pocket.
Lets say players start with 0.
One player has elo 0 and lose 10 points. He will lose nothing and this points will go to some pocket.
Now lets imagine the player won the next game and got 11 points. Instead of getting those 11 points, he would get just 1 point. And now his "pocket" would have 0 points.

I think that most ELO-like systems on the Internet will have a problem with ratings creep.
Assume that all new players start at a rating of zero. LousyPlayer loses a dozen games, and his rating goes far below zero. What stops him from clearing his browser, registering a new account under a new email address, and starting over?
If this is possible, then low-ranked accounts will lay fallow, moving up the practical average rating.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas