Implementing Wilson Score in SQL - sql

We have a relatively small table that we would like to sort based on rating, using the Wilson interval or a reasonable equivalent. I'm a reasonably smart guy, but my math fu is nowhere near strong enough to understand this:
The above formula, I am told, calculates a score for a positive/negative (thumbs up/thumbs down) voting system. I've never taken a statistics course, and it's been 15 years since I've done any sort of advanced mathematics. I don't have a clue what the little hat that the p is wearing means, or what the backwards Jesus fish beneath z indicates.
I would like to know two things:
Can this formula be altered to accommodate a 5-star rating system? I found this, but the author expresses his doubts as to the accuracy of his formula.
How can this formula be expressed in a SQL function? Note that I do not need to calculate and sort in real-time. The score can be calculated and cached daily.
Am I overlooking something built-in to Microsoft SQL Server?

Instead of trying to manipulate the Wilson's algorithm to do a 5 star rating system. Why don't you look into a different algorithm? This is what imdb uses for their top 250: Bayesian Estimate
As for explaining the math in the Wilson's algorithm, below was posted on the link in your first post. It is written in Ruby.
require 'statistics2'
def ci_lower_bound(pos, n, power)
if n == 0
return 0
end
z = Statistics2.pnormaldist(1-power/2)
phat = 1.0*pos/n
(phat + z*z/(2*n) - z * Math.sqrt((phat*(1-phat)+z*z/(4*n))/n))/(1+z*z/n)
end
If you'd like another example, here is one in PHP:
http://www.derivante.com/2009/09/01/php-content-rating-confidence/
Edit: It seems that derivante.com is no longer around. You can see the original article on archive.org - https://web.archive.org/web/20121018032822/http://derivante.com/2009/09/01/php-content-rating-confidence/ and I've added the code from the article below.
class Rating
{
public static function ratingAverage($positive, $total, $power = '0.05')
{
if ($total == 0)
return 0;
$z = Rating::pnormaldist(1-$power/2,0,1);
$p = 1.0 * $positive / $total;
$s = ($p + $z*$z/(2*$total) - $z * sqrt(($p*(1-$p)+$z*$z/(4*$total))/$total))/(1+$z*$z/$total);
return $s;
}
public static function pnormaldist($qn)
{
$b = array(
1.570796288, 0.03706987906, -0.8364353589e-3,
-0.2250947176e-3, 0.6841218299e-5, 0.5824238515e-5,
-0.104527497e-5, 0.8360937017e-7, -0.3231081277e-8,
0.3657763036e-10, 0.6936233982e-12);
if ($qn < 0.0 || 1.0 < $qn)
return 0.0;
if ($qn == 0.5)
return 0.0;
$w1 = $qn;
if ($qn > 0.5)
$w1 = 1.0 - $w1;
$w3 = - log(4.0 * $w1 * (1.0 - $w1));
$w1 = $b[0];
for ($i = 1;$i <= 10; $i++)
$w1 += $b[$i] * pow($w3,$i);
if ($qn > 0.5)
return sqrt($w1 * $w3);
return - sqrt($w1 * $w3);
}
}
As for doing this in SQL, SQL has all these Math functions already in it's library. If I were you I'd do this in your application though. Make your application update your database every so often (hours? days?) instead of doing this on the fly or your application will become very slow.

Regarding your first question (adjusting the formula to the 5-stars system) I would agree with Paul Creasey.
conversion formula: [3 +/- i stars -> i up/down-votes] (3 stars -> 0)
example: 4 stars -> +1 up-vote, 5 stars -> +2, 1 -> -2 and so on.
I would note though that instead of the lower bound of the interval that both ruby and php functions compute, I would just compute the much more simple wilson midpoint:
(x + (z^2)/2) / (n + z^2)
where:
n = Sum(up_votes) + Sum(|down_votes|)
x = (positive votes)/n = Sum(up_votes) / n
z = 1.96 (fixed value)

Taking Williams link to the php solution http://www.derivante.com/2009/09/01/php-content-rating-confidence/ and making your system such that it just postive and negative (5 stars could be 2 pos, 1 start could be 2 neg perhaps) then it would be fairly easy to convert it to T-SQL, but you'd be much better off doing it in the server side logic.

The author of the first link recently added an SQL implementation to his post.
Here it is:
SELECT widget_id, ((positive + 1.9208) / (positive + negative) -
1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) /
(positive + negative)) / (1 + 3.8416 / (positive + negative))
AS ci_lower_bound FROM widgets WHERE positive + negative > 0
ORDER BY ci_lower_bound DESC;
Whether this can be accommodated to a 5-star rating system is beyond me too.

I have uploaded an Oracle PL/SQL implementation to https://github.com/mattgrogan/stats_wilson_score
create or replace function stats_wilson_score(
/*****************************************************************************************************************
Author : Matthew Grogan
Website : https://github.com/mattgrogan
Name : stats_wilson_score.sql
Description : Oracle PL/SQL function to return the Wilson Score Interval for the given proportion.
Citation : Wilson E.B. J Am Stat Assoc 1927, 22, 209-212
Example:
select
round(29 / 250, 4) point_estimate,
stats_wilson_score(29, 250, 0.10, 'LCL') lcl,
stats_wilson_score(29, 250, 0.10, 'UCL') ucl
from dual;
******************************************************************************************************************/
x integer, -- Number of successes
m integer, -- Number of trials
alpha number default 0.95, -- Probability of a Type I error
return_value varchar2 default 'LCL' -- LCL = Lower control limit, UCL = upper control limit
)
return number is
z float(10);
phat float(10) := 0.0;
lcl float(10) := 0.0;
ucl float(10) := 0.0;
begin
if m = 0 then
return(0);
end if;
case alpha
when 0.10 then z := 1.644854;
when 0.05 then z := 1.959964;
when 0.01 then z := 2.575829;
else return(null); -- No Z value for this alpha
end case;
phat := x/m;
lcl := (phat + z*z/(2*m) - z * sqrt( (phat * (1-phat) ) / m + z * z / (4 * (m * m)) ) ) / (1 + z * z / m);
ucl := (phat + z*z/(2*m) + z * sqrt((phat*(1-phat)+z*z/(4*m))/m))/(1+z*z/m);
case return_value
when 'LCL' then return(lcl);
when 'UCL' then return(ucl);
else return(null);
end case;
end;
/
grant execute on stats_wilson_score to public;

The Wilson score is actually not a very good of a way of sorting items by rating. It's certainly better than just sorting by mean review score, but it still has a lot of problems. For example, an item with 1 negative review (whose quality is still very uncertain) will be sorted below an item with 10 negative reviews and 1 positive review (which we can be fairly certain is bad quality).
I would recommend using an adaptation of the SteamDB rating formula instead (by Reddit user /u/tornmandate). In addition to being better suited to this sort of thing than the Wilson score (for reasons that are explained in the linked article), it can also be adapted to a 5-star rating system much more easily than Wilson.
Original SteamDB formula:
( Total Reviews = Positive Reviews + Negative Reviews )
( Review Score = frac{Positive Reviews}{Total Reviews} )
( Rating = Review Score - (Review Score - 0.5)*2^{-log_{10}(Total Reviews + 1)} )
5-star version (note the change from 0.5 (a 50% score with up/down votes) to 2.5 (a 50% score with 5-star ratings)):
( Total Reviews = total count of all reviews )
( Review Score = mean star rating of all reviews )
( Rating = Review Score - (Review Score - 2.5)*2^{-log_{10}(Total Reviews + 1)} )
The formula is also much more understandable by non-mathematicians and easy to translate into code.

Related

Query for "surrounding hex tiles" based on axial coordinates

After reading Red Blob Games' excellent article on heaxgon tile maps and their coordinates.
I am wondering how one would write a SQL-query that returns the tiles surrounding a centered tile up to a range of X. (assuming the "axial coordinates" covered in the article)
A simple idea I first had was
WHERE x BETWEEN tile_x - 1 AND tile_x + 1 AND y BETWEEN tile_y - 1 AND tile_y + 1
But this will return too many tiles, in a way that creates a shape more like a rhombus rather than a circle, which is what I need.
Unfortunately, I haven't found a conclusive answer to this, maybe someone here can give me a hint.
I already thought about some tricks on the sum of the coordinates and wether they are larger of lower than the range, but this won't work with the axial coordinates.
From the diagrams in the linked article, it seems that something like
where (x between tile_x and tile_x + 1) and (y between tile_y - 1 and tile_y + 1)
or (x = tile_x - 1) and (y = tile_y)
should work
If you want to find the tiles (tile_x, tile_y) within a distance n from a given tile (x, y), it will be easier if the x coordinate is modified by adding 0.5 to the x coordinate of each row having an odd distance from the given tile, such that the symmetry is increased:
-1.5 -0.5 0.5 1.5
-2 -1 0 1 2
-2.5 -1.5 -0.5 0.5 1.5 2.5
-3 -2 -1 0 1 2 3
-2.5 -1.5 -0.5 0.5 1.5 2.5
-2 -1 0 1 2
-1.5 -0.5 0.5 1.5
This can be achieved using the expression tile_x + 0.5 * tile_y%2
As the number of tiles within the given distance is reduced by one
from row to row, the limits of (modified) x coordinates in a given
row is n - abs(tile_y - y)/2.
Then the tiles is within a distance n if
abs(tile_y - y) <= n
and abs(tile_x - x + 0.5 * (tile_y-y)%2) <= n - abs(tile_y - y)/2
In sql:
SELECT tile_x, tile_y
FROM tiles
WHERE ABS(tile_y - y) <= n
AND ABS(tile_x - x +0.5*(tile_y-y)%2) + ABS(tile_y - y) / 2 <= n
After some more reading, I found that the article I linked actually alread solves that problem, although somewhat hidden, so here is a more "straight forward" answer.
(Nonetheless, Terje D.'s answer works fine, I just post this because I feel it's a little easier to understand than the "sorcery" of his answer)
In the article from Red Blob Games, he actually describes the formula that calculates the distance between two given hexes:
function hex_distance(Hex(q1, r1), Hex(q2, r2)) {
return (abs(q1 - q2) + abs(r1 - r2)
+ abs(q1 + r1 - q2 - r2)) / 2;
}
I translated this into a function in MySQL
CREATE FUNCTION `rangeBetweenTiles`(tile1q int, tile1r int, tile2q int, tile2r int) RETURNS int(11)
DETERMINISTIC
BEGIN
RETURN (abs(tile1q - tile2q) + abs(tile1r - tile2r)
+ abs(tile1q + tile1r - tile2q - tile2r)) / 2;
END
And use this in another function:
CREATE FUNCTION `isInRange`(tile1q int, tile1r int, tile2q int, tile2r int, `range` Int) RETURNS tinyint(1)
DETERMINISTIC
BEGIN
RETURN rangeBetweenTiles(tile1q, tile1r, tile2q, tile2r) <= `range`;
END
Which can then be used very easily in a select statement:
select *
from tiles
where isInRange(:tile_q, :tile_r, positionQ, positionR, :n)
And it works perfectly fine for any :n

Smooth Coloring Mandelbrot Set Without Complex Number Library

I've coded a basic Mandelbrot explorer in C#, but I have those horrible bands of color, and it's all greyscale.
I have the equation for smooth coloring:
mu = N + 1 - log (log |Z(N)|) / log 2
Where N is the escape count, and |Z(N)| is the modulus of the complex number after the value has escaped, it's this value which I'm unsure of.
My code is based off the pseudo code given on the wikipedia page: http://en.wikipedia.org/wiki/Mandelbrot_set#For_programmers
The complex number is represented by the real values x and y, using this method, how would I calculate the value of |Z(N)| ?
|Z(N)| means the distance to the origin, so you can calculate it via sqrt(x*x + y*y).
If you run into an error with the logarithm: Check the iterations before. If it's part of the Mandelbrot set (iteration = max_iteration), the first logarithm will result 0 and the second will raise an error.
So just add this snippet instead of your old return code. .
if (i < iterations)
{
return i + 1 - Math.Log(Math.Log(Math.Sqrt(x * x + y * y))) / Math.Log(2);
}
return i;
Later, you should divide i by the max_iterations and multiply it with 255. This will give you a nice rgb-value.

Calculate APR (annual percentage rate) Programmatically

I am trying to find a way to programmatically calculate APR based on
Total Loan Amount
Payment Amount
Number of payments
Repayment frequency
There is no need to take any fees into account.
It's ok to assume a fixed interest rate, and any remaining amounts can be rolled into the last payment.
The formula below is based on a credit agreement for a total amount of credit of €6000 repayable in 24 equal monthly instalments of €274.11.
(The APR for the above example is 9.4%)
I am looking for an algorithm in any programming language that I can adapt to C.
I suppose you want to compute X from your equation. This equation can be written as
f(y) = y + y**2 + y**3 + ... + y**N - L/P = 0
where
X = APR
L = Loan (6000)
P = Individual Payment (274.11)
N = Number of payments (24)
F = Frequency (12 per year)
y = 1 / ((1 + X)**(1/F)) (substitution to simplify the equation)
Now, you need to solve the equation f(y) = 0 to get y. This can be done e.g. using the Newton's iteration (pseudo-code):
y = 1 (some plausible initial value)
repeat
dy = - f(y) / f'(y)
y += dy
until abs(dy) < eps
The derivative is:
f'(y) = 1 + 2*y + 3*y**2 + ... + N*y**(N-1)
You would compute f(y) and f'(y) using the Horner rule for polynomials to avoid the exponentiation. The derivative can be likely approximated by some few first terms. After you find y, you get x:
x = y**(-F) - 1
Here is the Objective C code snippet I came up with (which seems to be correct) if anybody is interested:
float x = 1;
do{
fx = initialPaymentAmt+paymentAmt *(pow(x, numPayments+1)-x)/(x-1)+0*pow(x,numPayments)-totalLoanAmt;
dx = paymentAmt *(numPayments * pow( x , numPayments + 1 ) - ( numPayments + 1 )* pow(x,numPayments)+1) / pow(x-1,2)+numPayments * 0 * pow(x,numPayments-1);
z = fx / dx;
x=x-z;
} while (fabs(z)>1e-9 );
apr=100*(pow(1/x,ppa)-1);

How would I do this in a program? Math question

I'm trying to make a generic equation which converts a value. Here are some examples.
9,873,912 -> 9,900,000
125,930 -> 126,000
2,345 -> 2,400
280 -> 300
28 -> 30
In general, x -> n
Basically, I'm making a graph and I want to make values look nicer. If it's a 6 digit number or higher, there should be at least 3 zeros. If it's a 4 digit number or less, there should be at least 2 digit numbers, except if it's a 2 digit number, 1 zero is fine.
(Ignore the commas. They are just there to help read the examples). Anyways, I want to convert a value x to this new value n. What is an equation g(x) which spits out n?
It is for an objective-c program (iPhone app).
Divide, truncate and multiply.
10**x * int(n / 10**(x-d))
What is "x"? In your examples it's about int(log10(n))-1.
What is "d"? That's the number of significant digits. 2 or 3.
Ahhh rounding is a bit awkward in programming in general. What I would suggest is dividing by the power of ten, int cast and multiplying back. Not remarkably efficient but it will work. There may be a library that can do this in Objective-C but that I do not know.
if ( x is > 99999 ) {
x = ((int)x / 1000) * 1000;
}
else if ( x > 999 ) {
x = ((int) x / 100) * 100;
}
else if ( x > 9 ) {
x = ((int) x / 10) * 10;
}
Use standard C functions like round() or roundf()... try man round at a command line, there are several different options depending on the data type. You'll probably want to scale the values first by dividing by an appropriate number and then multiplying the result by the same number, something like:
int roundedValue = round(someNumber/scalingFactor) * scalingFactor;

Fast formula for a "high contrast" curve

My inner loop contains a calculation that profiling shows to be problematic.
The idea is to take a greyscale pixel x (0 <= x <= 1), and "increase its contrast". My requirements are fairly loose, just the following:
for x < .5, 0 <= f(x) < x
for x > .5, x < f(x) <= 1
f(0) = 0
f(x) = 1 - f(1 - x), i.e. it should be "symmetric"
Preferably, the function should be smooth.
So the graph must look something like this:
.
I have two implementations (their results differ but both are conformant):
float cosContrastize(float i) {
return .5 - cos(x * pi) / 2;
}
float mulContrastize(float i) {
if (i < .5) return i * i * 2;
i = 1 - i;
return 1 - i * i * 2;
}
So I request either a microoptimization for one of these implementations, or an original, faster formula of your own.
Maybe one of you can even twiddle the bits ;)
Consider the following sigmoid-shaped functions (properly translated to the desired range):
error function
normal CDF
tanh
logit
I generated the above figure using MATLAB. If interested here's the code:
x = -3:.01:3;
plot( x, 2*(x>=0)-1, ...
x, erf(x), ...
x, tanh(x), ...
x, 2*normcdf(x)-1, ...
x, 2*(1 ./ (1 + exp(-x)))-1, ...
x, 2*((x-min(x))./range(x))-1 )
legend({'hard' 'erf' 'tanh' 'normcdf' 'logit' 'linear'})
Trivially you could simply threshold, but I imagine this is too dumb:
return i < 0.5 ? 0.0 : 1.0;
Since you mention 'increasing contrast' I assume the input values are luminance values. If so, and they are discrete (perhaps it's an 8-bit value), you could use a lookup table to do this quite quickly.
Your 'mulContrastize' looks reasonably quick. One optimization would be to use integer math. Let's say, again, your input values could actually be passed as an 8-bit unsigned value in [0..255]. (Again, possibly a fine assumption?) You could do something roughly like...
int mulContrastize(int i) {
if (i < 128) return (i * i) >> 7;
// The shift is really: * 2 / 256
i = 255 - i;
return 255 - ((i * i) >> 7);
A piecewise interpolation can be fast and flexible. It requires only a few decisions followed by a multiplication and addition, and can approximate any curve. It also avoids the courseness that can be introduced by lookup tables (or the additional cost in two lookups followed by an interpolation to smooth this out), though the lut might work perfectly fine for your case.
With just a few segments, you can get a pretty good match. Here there will be courseness in the color gradients, which will be much harder to detect than courseness in the absolute colors.
As Eamon Nerbonne points out in the comments, segmentation can be optimized by "choos[ing] your segmentation points based on something like the second derivative to maximize detail", that is, where the slope is changing the most. Clearly, in my posted example, having three segments in the middle of the five segment case doesn't add much more detail.