Best practice for calculate user rate and more - sql

I am building an application that shares some stuff...
Each Object can be rated has a rating 1..5 start. I keep for each the number of rates per star so can calculate the Avg rate.
So per Obj I have: Avg rate and total rate.
I need to get the top10 rated Obj - so can do it using AvgRate+TotalRate (those who has these values as top10).
I want to have in the server an sql table like this:
ObjId (index), totalRate, AvgRate...
If possible to have this table sorted so that can get the top10 as the first 10?
How can query the top10 with the calculation I want?
Also - I need to get the top10 per users. So per user I have all the Obj he shared so can have all of the rates of these Obj - with all info per Obj as mentioned before.
I need to know how to calculate a user rate, and also - how to fast get the top10.
Any ideas?

Later Edit: Sorry, didn't understand your question when writing this answer, gonna leave it still for others..
What's your formula for TotalRate ? And what do you mean by "so can do it using AvgRate+TotalRate" Why are you summing an average to TotalRate - whatever that is?
Best practice is to always compute the sums/averages incrementally.
I would model Obj like this:
A total number of rates received
B total sum of points received
C average (float: B/A )
D - foreign key to user (author/owner of Obj)
When object receives rate X, you then recompute A = A + 1, B = B + X, C = B/A.
In the same manner pre-compute aggregate sums/average. So if Obj belongs to user, create the same fields (A, B, C) to User model/table, and when Obj receives rate X, also update A, B, C values for user D (owner of Obj). Then, when selecting top 10 users, you do not need to join with Obj table (which may get huge), you only select users - descending by B or C column, limit 10.

Related

UPDATE Capacity using COUNT()

I have three tables which are joined by the following
FLIGHT F,
RESERVATION R,
AIRPLANE A
where F.AirplaneSerialNum = A.AirplaneSerialNum
and F.FlightCode = R.FlightCode
In the airplane table, there is a column to store the maximum capacity of any given plane.
In the reservation table, records of passengers are stored, and the flight they are embarking on is based on the FlightCode
In the flight table, there is a column to store the remaining capacity of any given plane, and each flight is uniquely determined by its FlightCode
Thus, I would like to find a way to update the remaining capacity by taking the values of the original maximum capacity, then get the remaining capacity by doing a COUNT() of the number of times the FlightCode appears in the reservation table
So far I've got the first half to work (setting RemCapacity as the original max capacity)
UPDATE FLIGHT F
SET F.RemCapacity = (SELECT Capacity FROM airplane
WHERE AIRPLANE.airplaneserialnum = F.airplaneserialnum);
However i'm stuck trying to minus the number of reservations
-- to get the count for number of times the FlightCode appears
SELECT COUNT(*) FROM reservation group by flightcode
UPDATE FLIGHT F
SET F.RemCapacity = F.RemCapacity -
(SELECT COUNT(*) FROM reservation group by flightcode ) WHERE F.FlightCode = R.FlightCode;
(returns %s invalid identifier SQL error)
And also if possible, how can I combine both halves into one query?
Not totally sure, but I think this might do the trick for you, doing all the work in one statement:
UPDATE FLIGHT F
SET F.RemCapacity = (SELECT Capacity FROM airplane
WHERE AIRPLANE.airplaneserialnum = F.airplaneserialnum) -
(SELECT COUNT(*) FROM reservation r WHERE F.FlightCode = R.FlightCode);

SQL Time Series Homework

Imagine you have this two tables.
a) streamers: it contains time series data, at a 1-min granularity, of all the channels that broadcast on
Twitch. The columns of the table are:
username: Channel username
timestamp: Epoch timestamp, in seconds, corresponding to the moment the data was captured
game: Name of the game that the user was playing at that time
viewers: Number of concurrent viewers that the user had at that time
followers: Number of total followers that the channel had at that time
b) games_metadata: it contains information of all the games that have ever been broadcasted on Twitch.
The columns of the table are:
game: Name of the game
release_date: Timestamp, in seconds, corresponding to the date when the game was released
publisher: Publisher of the game
genre: Genre of the game
Now I want the Top 10 publishers that have been watched the most during the first quarter of 2019. The output should contain publisher and hours_watched.
The problem is I don't have any database, I created one and inputted some values by hand.
I thought of this query, but I'm not sure if it is what I want. It may be right (I don't feel like it is ), but I'd like a second opinion
SELECT publisher,
(cast(strftime('%m', "timestamp") as integer) + 2) / 3 as quarter,
COUNT((strftime('%M',`timestamp`)/(60*1.0)) * viewers) as total_hours_watch
FROM streamers AS A INNER JOIN games_metadata AS B ON A.game = B.game
WHERE quarter = 3
GROUP BY publisher,quarter
ORDER BY total_hours_watch DESC
Looks about right to me. You don't need to include quarter in the GROUP BY since the where clause limits you to only one quarter. You can modify the query to get only the top 10 publishers in a couple of ways depending on the SQL server you've created.
For SQL Server / MS Access modify your select statement: SELECT TOP 10 publisher, ...
For MySQL add a limit clause at the end of your query: ... LIMIT 10;

Get distance and duration to closest matrix in SQL

I have an logic to find the most optimised to perform delivery.
Lets say, I have location A,B,C. So need distance and duration from A to B, B to A, A to C, C to A, B to C and C to B.
I know how to come out with above query. Example result would be NewMatrix in fiddle.
http://sqlfiddle.com/#!6/9cce7/1
I have a table where I store current matrix we have based on past deliveries. (AppMatrix in table above)
So I need to lookup distance and duration in this table, to find closest matching origin and destination. I have created following function which works just perfect to get my answer :
SELECT TOP 1 Distance, ([Time]/60) as Duration FROM [AppMatrix]
ORDER BY ABS([OriginSiteLat] - #OriginLat) + ABS([OriginSiteLng] - #OriginLong)
,ABS([DestSiteLat] - #DestinationLat) + ABS([DestSiteLng] - #DestinationLong)
The problem is slowness. Since I need to perform these call with each matrix (I can have 700 different deliveries in a day, 700*700 = 14000, this just too slow - it takes few hours to return result)
I'm working on best how to limit the data, but any advise on how to optimize performance is appreciated. Maybe advice on how to use spatial here would help.
This is my current code :
SELECT * FROM CT as A
INNER JOIN CT AS B ON A.Latitude <> B.Latitude AND A.Longitude<>B.Longitude
CROSS APPLY [dbo].[ufn_ClosestLocation](A.Latitude,A.Longitude, B.Latitude, B.Longitude) R

Effectively sorting objects in an array using multiple NSSortDesciptors

I have an array of dictionaries. Each dictionary holds data about an individual audio track. My app uses a star rating system so users can rate track 1-5 stars. Each dictionary has its own rating data per track, as follows:
avgRating (ex: 4.6)
rating_5_count (integer representing how many 5-star ratings a track received)
rating_4_count
rating_3_count
rating_2_count
rating_1_count
I'm trying to create a Top Charts table in my app. I'm creating a new array with objects sorted by avgRating. I understand how to sort the objects using NSSortDescriptors, but here is where I'm running into trouble...
If I only use avgRating as a sort descriptor, then if a track only receives one 5-star rating, it will jump to the top of the charts and beat out a track that might have a 4.9 with hundreds of votes.
I could set a minimum vote count to prevent this in the Top Charts array, but I would rather not do this. I would then have to change the min vote count as I get more users.
This is a bit subjective, but does anyone have any other suggestions on how to effectively sort the array?
There is many ways to deal with such a situation.
One approach could be to consider the number of votes as a measure of the confidence in the rating average. Starts with an average set at 3 (per example).
const double baseConfidenceRating = 3;
NSUInteger averageRating = ...;
NSUInteger voteCount = ...;
const NSUInteger baseConfidence = log10( 1000 );
double confidence = log10( 1 + voteCount ) / baseConfidence;
double confidenceWeight = fmin( confidence, 1.0 );
double confidenceRating = (1.0 - confidenceWeight) * baseConfidenceRating + confidenceWeight * averageRating;
Now sort your array based on confidenceRating instead of averageRating.
You can tweak the algorithm above by changing how many votes are needed so confidenceRating equals averageRating, and of course, you can change the function I used in the example. square root could work as well, or why not a linear progression. Your call.
This is just an example of course, a pretty dumb one. The standard deviation of the votes may add some intelligence in the algorithm, taking not only the number of votes into account but also the distribution of votes. 100 votes on 100 at 5 have more 'confidence' than 1000 votes scattered randomly between 0 and 5. Methink.
Yes, add a method to your class that returns a weight calculated from the average and number of votes. The magic formula for the weight is up to you, obviously. Something like avgRating * log2 (2 + number_of_votes) might do it. Then use a single sort descriptor that sorts on this method.

How to add "weights" to a MySQL table and select random values according to these?

I want to create a table, with each row containing some sort of weight. Then I want to select random values with the probability equal to (weight of that row)/(weight of all rows). For example, having 5 rows with weights 1,2,3,4,5 out of 1000 I'd get approximately 1/15*1000=67 times first row and so on.
The table is to be filled manually. Then I'll take a random value from it. But I want to have an ability to change the probabilities on the filling stage.
I found this nice little algorithm in Quod Libet. You could probably translate it to some procedural SQL.
function WeightedShuffle(list of items with weights):
max_score ← the sum of every item’s weight
choice ← random number in the range [0, max_score)
current ← 0
for each item (i, weight) in items:
current ← current + weight
if current ≥ choice or i is the last item:
return item i
The easiest (and maybe best/safest?) way to do this is to add those rows to the table as many times as you want the weight to be - say I want "Tree" to be found 2x more often then "Dog" - I insert it 2 times into the table and I insert "Dog" once and just select elements at random one by one.
If the rows are complex/big then it would be best to create a separate table (weighted_Elements or something) in which you'll just have foreign keys to the real rows inserted as many times as the weights dictate.
The best possible scenario (if i understand your question properly) is to setup your table as you normally would and then add two columns both INT's.
Column 1: Weight - This column would hold your weight value going from -X to +X, X being the highest value you want to have as a weight (IE: X=100, -100 to 100). This value is populated to give the row an actual weight and increase or decrease the probability of it coming up.
Column 2: *Count** - This column would hold the count of how many times this row has come up, this column is needed only if you want to use fair weighting. Fair weighting prevents one row from always showing up. (IE: if you have one row weighted at 100 and another at 2 the row with 100 will always show up, this column will allow weight 2 to be more 'valueable' as you get more weight 100 results). This column should be incremented by 1 each time a row result is pulled but you can make the logic more advanced later so it adds the weight etc.
Logic: - Its really simple now, your query simply has to request all rows as you normally would then make an extra select that (you can change the logic here to whatever you want) takes the weights and subtracts the count and order by that column.
The end result should be a table where you will get your weights appearing more often until a certain point where the system will evenly distribute itself out (leave out column 2) and you will have a system that will always return the same weighted order unless you offset the base of the query (IE: LIMIT [RANDOM NUMBER], [NUMBER OF ROWS TO RETURN])
I'm not an expert in probability theory, but assuming you have a column called WEIGHT, how about
select FIELD_1, ... FIELD_N, (rand() * WEIGHT) as SCORE
from YOURTABLE
order by SCORE
limit 0, 10
This would give you 10 records, but you can change the limit clause, of course.
The problem is called Reservoir Sampling (https://en.wikipedia.org/wiki/Reservoir_sampling)
The A-Res algorithm is easy to implement in SQL:
SELECT *
FROM table
ORDER BY pow(rand(), 1 / weight) DESC
LIMIT 10;
I came looking for the answer to the same question - I decided to come up with this:
id weight
1 5
2 1
SELECT * FROM table ORDER BY RAND()/weight
it's not exact - but it is using random so i might not expect exact. I ran it 70 times to get number 2 10 times. I would have expect 1/6th but i got 1/7th. I'd say that's pretty close. I'd have to run a script to do it a few thousand times to get a really good idea if it's working.