Maths! Approximating the mean, without storing the whole data set - sql

Obvious (but expensive) solution:
I would like to store rating of a track (1-10) in a table like this:
TrackID
Vote
And then a simple
SELECT AVERAGE(Vote) FROM `table` where `TrackID` = some_val
to calculate the average.
However, I am worried about scalability on this, especially as it needs to be recalculated each time.
Proposed, but possibly stupid, solution:
TrackID
Rating
NumberOfVotes
Every time someone votes, the Rating is updated with
new_rating = ((old_rating * NumberOfVotes) + vote) / (NumberOfVotes + 1)
and stored as the TrackID's new Rating value. Now whenever the Rating is wanted, it's a simple lookup, not a calculation.
Clearly, this does not calculate the mean. I've tried a few small data sets, and it approximates the mean. I believe it might converge as the data set increases? But I'm worried that it might diverge!
What do you guys think? Thanks!

Assuming you had infinite numeric precision, that calculation does update the mean correctly. In practice, you're probably using integer types, so it won't be exact.
How about storing the cumulative vote count, and the number of votes? (i.e. total=total+vote, numVotes=numVotes+1). That way, you can get the exact mean by dividing one by the other.
This approach will only break if you get so many votes that you overflow the range of the data type you're using. So use a big data type (32-bit ought to be enough, unless you're expecting ~4 billion votes)!

Store TrackId, RatingSum, NumberOfVotes in your table.
Every time someone votes,
NumberOfVotes = NumberOfVotes + 1
RatingsSum = RatingsSum + [rating supplied by user]
Then when selecting
SELECT TrackId, RatingsSum / NumberOfVotes FROM ...

Your solution is completely legit. and differes only by roughly a few times the floating point precision from a value calculated from the full source set.

You can certainly calculate a running mean and standard deviation without having all the points in hand. You merely need to accumulate the sum, sum of squares, and number of points.
It's not an approximation; the mean and standard deviation are exact.
Here's a Java class that demonstrates. You can adapt to your SQL solution as needed:
package statistics;
public class StatsUtils
{
private double sum;
private double sumOfSquares;
private long numPoints;
public StatsUtils()
{
this.init();
}
private void init()
{
this.sum = 0.0;
this.sumOfSquares = 0.0;
this.numPoints = 0L;
}
public void addValue(double value)
{
// Check for overflow in either number of points or sum of squares; reset if overflow is detected
if ((this.numPoints == Long.MAX_VALUE) || (this.sumOfSquares > (Double.MAX_VALUE-value*value)))
{
this.init();
}
this.sum += value;
this.sumOfSquares += value*value;
++this.numPoints;
}
public double getMean()
{
double mean = 0.0;
if (this.numPoints > 0)
{
mean = this.sum/this.numPoints;
}
return mean;
}
public double getStandardDeviation()
{
double standardDeviation = 0.0;
if (this.numPoints > 1)
{
standardDeviation = Math.sqrt((this.sumOfSquares - this.sum*this.sum/this.numPoints)/(this.numPoints-1L));
}
return standardDeviation;
}
public long getNumPoints() { return this.numPoints; }
}

Small improvement on your solution. You have the table:
TrackID
SumOfVotes
NumberOfVotes
When someone votes,
NumberOfVotes = NumberOfVotes + 1
SumOfVotes = SumOfVotes + ThisVote
and to see the average you only then do a division:
SELECT TrackID, (SumOfVotes/NumberOfVotes) AS Rating FROM `table`
I would add that the original (obvious and expensive) solution is only expensive compared to the provied solution when calculating the average.
It is cheaper when a vote is added, deleted or changed.
I guess that the original table
TrackID
Vote
VoterID
would still need to be used in the provided solution to keep track of the vote (rating) of every voter. SO, two tables have to be updated for every change in this table (insert, delete or Vote update).
In other words, the original solution may be the best way to go.

Related

Constraint to require a certain number of matches

I am trying to write a hard constraint that requires that a certain value has been chosen a certain number of times. I have a constraint written below, which (I think) filters to a set of results that match this criteria, and I want it to penalize if there are no such results. I cannot figure out how to work .ifNotExists() into this. I think I am missing some understanding.
fun cpMustUseN(constraintFactory: ConstraintFactory): Constraint {
return constraintFactory.forEach(MealMenu::class.java)
.join(CpMustUse::class.java, equal({ mm -> mm.slottedCp!!.id }, CpMustUse::cpId))
.groupBy({ _, cpMustUse -> cpMustUse.numRequired }, countBi())
.filter { numRequired, count -> count >= numRequired }
.penalize(HardSoftScore.ONE_HARD)
.asConstraint("cpMustUseN")
}
MealMenu is an entity:
#PlanningEntity
class MealMenu {
#PlanningId
var id = 0
#PlanningVariable(valueRangeProviderRefs = ["cpRange"])
var slottedCp: Cp? = null
}
CpMustUse is a #ProblemFactCollectionProperty on my solution class, and the class looks like this:
class CpMustUse {
var cpId = 1
var numRequired = 4
}
I want to, in this case, constrain the result such that cpId 1 is chosen at least 4 times.
There are two conceptual issues here:
groupBy() will only match if the join returns a non-zero number of matches. Therefore you will never get a countBi() of zero - in that case, groupBy() will simply never match. Therefore you can not use grouping to check that something does not exist.
ifNotExists() always applies to a fact from the working memory. You can not use it to check if a result of a previous calculation exists.
Combined together, this makes your approach infeasible. This particular requirement will be a bit trickier to implement.
Start by inverting the logic of the constraint you pasted. Penalize every time count < numRequired; this handles all cases where count >= 1.
Then introduce a second constraint that will handle specifically the case where the count would be zero - in this case, you should be able to use forEach(MealMenu::class.java).ifNotExists(CpMustUse::class, ...).

ScrollableResults size gives repeated value

I am working on application using hibernate and spring. I am trying to get count of result got by query by using ScrollableResults but as query contains lots of join(Inner joins), the result contains id repeated many times. this creates problem for ScrollableResults when i am using it to know total no of unique rows(or unique ids) returned from database. please help. Some part of code is below :
StringBuffer queryBuf = new StringBuffer("Some SQL query with lots of Joins");
Query query = getSession().createSQLQuery(queryBuf.toString());
query.setReadOnly(true);
ScrollableResults results = query.scroll();
if (results.isLast() == false)
results.last();
int total = results.getRowNumber() + 1;
logger.debug(">>>>>>TOTAL COUNT<<<<<< = {}", total);
It gives total count 1440 but actual unique rows in database is 504.
Thanks in Advance.
You can try
Integer count= ((Long)query.uniqueResult()).intValue();
Unfortunately, getRowNumber does not give you the size, or the number of results, but the current position in the results. ScrollableResults does not provide a way to get the number of results out-of-the-box.
I am referring to ScrollableResults Hibernate Version 5.4.
As a workaround, you can try
Long l_resultsCount = 0L;
while(results.next()) {
l_resultsCount++;
}
getRowNumber() gives the number of the current row.
Call last() and afterwards getRowNumber()+1 will give the total number of results.

SQL "group by" like - grouping algorithm

I have a table with more than 2 columns (let's say A, B and C). One column holds some numbers (C) and I want to do a "group by" like grouping, summing the numbers in C, but I don't know the algorithm for doing so.
I tried sorting the table by each column (from last to first, aside from the numbers column (C), so in this case: sort(B) and then sort(A)) and then, wherever nth row holds same values in A and B as in n-1th row, I add the number from nth row to n-1th row (in the C column), and then delete the nth row. Else, if A or B value in row n differs from A or B value in n-1th row, I'll just move to the next row. Then I repeat the algorithm till the last row in table. But somehow this isn't working all the time, especially when there're a lot more columns (some rows remain ungrouped, maybe because of the sorting method).
I want to know whether this is a good grouping algorithm and I need to look for the problem into the sorting method, or I need to use another (sorting and/or grouping) algorithm and which one. Thank you.
LE: Apparently the algorithm that I used works well after a thorough check of the code and fixing some minor mistakes that junior programmers like me often make :)
I think a good way to do this would be to wrap your row into a class, implement the equals method, and then use a Map to add the values up:
public class MyRow {
private Long columnA;
private String columnB;
private int columnC;
#Override
public boolean equals(final Object other) {
if (!other instanceof MyRow) {
return false;
}
final MyRow otherRow = (MyRow) other;
return this.columnA.equals(otherRow.getColumnA()) && this.columnB.equals(otherRow.getColumnB);
}
}
Then you can iterate over all the rows, and create a Map for holding the sums of C.
final Map<MyRow, Integer> computedCSums = new HashMap<MyRow, Integer>();
for (final MyRow myRow : myRows) {
if (computedCSums.get(myRow) == null) {
computedCSums.put(myRow, myRow.getColumnC());
} else {
computedCSums.put(myRow, computedSums.get(myRow) + myRow.getColumnC());
}
}
Then, to get the sum of grouped Cs of any row, you just do:
computedCSum.get(mySelectedRow);
I think there is three things should be considered about group by
less or equal is abstract
comparing two rows A, B according it columns (C1..Cn) are like this : compare each column from C1 to Cn , if we can get which is less, then return ,or if the two values are equal, then we go to compare next, repeat this until return.
which algorithm we choose
1)build a binary search tree or a hash table to store tuples , when we get a tuple, search the equal tuple , if we have , then merge the tuple which have the same group value, else put it to our search structure
2) read some tuples, then sort , walk the buffer and merge the same group
I prefer 1 rather than 2.
memory size
if out input is huge, we must consider memory limit.
we can use merge algorithm to deal this.
if memory exceed our limit , then write the tuples in memory to the tape order by their group columns
when we finish reading the input, then merge the result set in tape.

Effectively sorting objects in an array using multiple NSSortDesciptors

I have an array of dictionaries. Each dictionary holds data about an individual audio track. My app uses a star rating system so users can rate track 1-5 stars. Each dictionary has its own rating data per track, as follows:
avgRating (ex: 4.6)
rating_5_count (integer representing how many 5-star ratings a track received)
rating_4_count
rating_3_count
rating_2_count
rating_1_count
I'm trying to create a Top Charts table in my app. I'm creating a new array with objects sorted by avgRating. I understand how to sort the objects using NSSortDescriptors, but here is where I'm running into trouble...
If I only use avgRating as a sort descriptor, then if a track only receives one 5-star rating, it will jump to the top of the charts and beat out a track that might have a 4.9 with hundreds of votes.
I could set a minimum vote count to prevent this in the Top Charts array, but I would rather not do this. I would then have to change the min vote count as I get more users.
This is a bit subjective, but does anyone have any other suggestions on how to effectively sort the array?
There is many ways to deal with such a situation.
One approach could be to consider the number of votes as a measure of the confidence in the rating average. Starts with an average set at 3 (per example).
const double baseConfidenceRating = 3;
NSUInteger averageRating = ...;
NSUInteger voteCount = ...;
const NSUInteger baseConfidence = log10( 1000 );
double confidence = log10( 1 + voteCount ) / baseConfidence;
double confidenceWeight = fmin( confidence, 1.0 );
double confidenceRating = (1.0 - confidenceWeight) * baseConfidenceRating + confidenceWeight * averageRating;
Now sort your array based on confidenceRating instead of averageRating.
You can tweak the algorithm above by changing how many votes are needed so confidenceRating equals averageRating, and of course, you can change the function I used in the example. square root could work as well, or why not a linear progression. Your call.
This is just an example of course, a pretty dumb one. The standard deviation of the votes may add some intelligence in the algorithm, taking not only the number of votes into account but also the distribution of votes. 100 votes on 100 at 5 have more 'confidence' than 1000 votes scattered randomly between 0 and 5. Methink.
Yes, add a method to your class that returns a weight calculated from the average and number of votes. The magic formula for the weight is up to you, obviously. Something like avgRating * log2 (2 + number_of_votes) might do it. Then use a single sort descriptor that sorts on this method.

search within an array with a condition

I have two array I'm trying to compare at many levels. Both have the same structure with 3 "columns.
The first column contains the polygon's ID, the second a area type, and the third, the percentage of each area type for a polygone.
So, for many rows, it will compare, for example, ID : 1 Type : aaa % : 100
But for some elements, I have many rows for the same ID. For example, I'll have ID 2, Type aaa, 25% --- ID 2, type bbb, 25% --- ID 2, type ccc, 50%. And in the second array, I'll have ID 2, Type aaa, 25% --- ID 2, type bbb, 10% --- ID 2, type eee, 38% --- ID 2, type fff, 27%.
here's a visual example..
So, my function has to compare these two array and send me an email if there are differences.
(I wont show you the real code because there are 811 lines). The first "if" condition is
if array1.id = array2.id Then
if array1.type = array2.type Then
if array1.percent = array2.percent Then
zone_verification = True
Else
zone_verification = False
The probleme is because there are more than 50 000 rows in each array. So when I run the function, for each "array1.id", the function search through 50 000 rows in array2. 50 000 searchs for 50 000 rows.. it's pretty long to run!
I'm looking for something to get it running faster. How could I get my search more specific. Example : I have many id "2" in the array1. If there are many id "2" in the array2, find it, and push all the array2.id = 3 in a "sub array" or something like that, and search in these specific rows. So I'll have just X rows in array1 to compare with X rows in array 2, not with 50 000. and when each "id 2" in array1 is done, do the same thing for "id 4".. and for "id 5"...
Hope it's clear. it's almost the first time I use VB.net, and I have this big function to get running.
Thanks
EDIT
Here's what I wanna do.
I have two different layers in a geospatial database. Both layers have the same structure. They are a "spatial join" of the land parcels (55 000), and the land use layer. The first layer is the current one, and the second layer is the next one we'll use after 2015.
So I have, for each "land parcel" the percentage of each land use. So, for a "land parcel" (ID 7580-80-2532, I can have 50% of farming use (TYPE FAR-23), and 50% of residantial use (RES-112). In the first array, I'll have 2 rows with the same ID (7580-80-2532), but each one will have a different type (FAR-23, RES-112) and a different %.
In the second layer, the same the municipal zoning (land use) has changed. So the same "land parcel" will now be 40% of residential use (RES-112), 20% of commercial (COM-54) and 40% of a new farming use (FAR-33).
So, I wanna know if there are some differences. Some land parcels will be exactly the same. Some parcels will keep the same land use, but not the same percentage of each. But for some land parcel, there will be more or less land use types with different percentage of each.
I want this script to compare these two layers and send me an email when there are differences between these two layers for the same land parcel ID.
The script is already working, but it takes too much time.
The probleme is, I think, the script go through all array2 for each row in array 1.
What I want is when there are more than 1 rows with the same ID in array1, take only this ID in both arrays.
Maybe if I order them by IDs, I could write a condition. kind of "when you find what you're looking for, stop searching when you'll find a different value?
It's hard to explain it clearly because I've been using VB since last week.. And english isn't my first language! ;)
If you just want to find out if there are any differences between the first and second array, you could do:
Dim diff = New HashSet(of Polygon)(array1)
diff.SymmetricExceptWith(array2)
diff will contain any Polygon which is unique to array1 or array2. If you want to do other types of comparisons, maybe you should explain what you're trying to do exactly.
UPDATE:
You could use grouping and lookups like this:
'Create lookup with first array, for fast access by ID
Dim lookupByID = array1.ToLookup(Function(p) p.id)
'Loop through each group of items with same ID in array2
For Each secondArrayValues in array2.GroupBy(Function(p) p.id)
Dim currentID As Integer = secondArrayValues.Key 'Current ID is the grouping key
'Retrieve values with same ID in array1
'Use a hashset to easily compare for equality
Dim firstArrayValues As New HashSet(of Polygon)(lookupByID(currentID))
'Check for differences between the two sets of data, for this ID
If Not firstArrayValues.SetEquals(secondArrayValues) Then
'Data has changed, do something
Console.WriteLine("Differences for ID " & currentID)
End If
Next
I am answering this question based on the first part that you wrote (that is without the EDIT section). The correct answer should explain a good algorithm but I am suggesting you to use DB capabilities because they have optimized many queries for these purpose.
Put all the records in DB two tables - O(n) time ... If the records are static you dont need to perform this step every time.
Table 1
id type percent
Table 2
id type percent
Then use the DB query, some thing like this
select count(*) from table1 t1, table2 t2 where t1.id!=t2.id and t1.type!=t2.type
(you can use some better queries, what I am trying to say is give the control to DB to perform this operation)
retrieve the result in your code and perform the necessary operation.
EDIT
1) You can sort them in O(n logn) time based on ID + type + Percent and then perform binary search.
2) Store the first record in hash map with appropriate key - could be ID only or ID+type
this will take O(n) time and searching ,if key is correct, will take constant time.
You need to define a structure to store this data. We'll store all the data in a LandParcel class, which will have a HashSet<ParcelData>
public class ParcelData
{
public ParcelType Type { get; set; } // This can be an enum, string, etc.
public int Percent { get; set; }
// Redefine Equals and GetHashCode conveniently
}
public class LandParcel
{
public ID Id { get; set; } // Whatever the type of the ID is...
public HashSet<ParcelData> Data { get; set; }
}
Now you have to build your data structure, with something like this:
Dictionary<ID, LandParcel> data1 = new ....
foreach (var item in array1)
{
LandParcel p;
if (!data1.TryGetValue(item.id, out p)
data1[item.id] = p = new LandParcel(id);
// Can this data be repeated?
p.Data.Add(new ParcelData(item.type, item.percent));
}
You do the same with a data2 dictionary for the second array. Now you iterate for all items in data1 and compare them with the item with the same id for data2.
foreach (var parcel2 in data2.Values)
{
var parcel1 = data1[parcel2.ID]; // Beware with exceptions here !!!
if (!parcel1.Data.SetEquals(parcel2.Data))
// You have different parcels
}
(Now that I look at it, we are practically doing a small database query here, kind of smelly code ...)
Sorry for the C# code since I don't really feel so comfortable with VB, but it should be fairly straightforward.