Select random non-repeating rows per user

Select random non-repeating rows per user - sql

The use case is, that I have a table products and user_match_product. For a specific user, I want to select X random products, for which that user has no match.
The naive way to do that, would be to make something like
SELECT * FROM products WHERE id NOT IN (SELECT p_id FROM user_match_product WHERE u_id = 123) ORDER BY random() LIMIT X
but that will become a performance bottleneck when having millions of rows.
I thought of some possible solutions which I will present here now. I would love to hear about your solutions for that problem or suggestions regarding my solutions.
Solution 1: Trust the randomness
Based on the fact that the product ids are monotonically increasing, one could optimistically generate X*C random numbers R_i where i between 1 and X*C, which are in the range [min_id, max_id], and hope that a select like the following will return X elements.
SELECT * FROM products p1 WHERE p1.id IN (R_1, R_2, ..., R_XC) AND NOT EXISTS (SELECT * FROM user_match_product WHERE u_id = 123 AND p_id = p1.id) LIMIT X
Advantages
If the random number generator is good, this will probably work very well within O(1)
Old and newly added products have the same probability of being choosen
Disadvantages
If the number of matches is near to the number of products, the collision probability might be very high.
Solution 2: Block-wise PRNG
One could create a permutation function permutate(seed, start, end, value) for the domain [START, END] that uses a seed for randomness. At time t0 a user A has 0 matched products and observes that E0 products exist. The first block for the user A at t0 is for the domain [1, E0]. The user remembers a counter C which initially is 0.
To select X products the user A first has to create the permutations P_i like
P_i = permutate(seed, START, END, C + i)
The following has to hold for the function.
permutate(seed, start, end, value) is element of [start, end]
value is element of [start, end]
The following query will return X non-repeating elements.
SELECT * FROM products WHERE id IN (P_1, ..., P_X)
When C reaches END, the next block is allocated by using END + 1 as the new START, the current count of products E1 as new END. The seed and C stay the same.
Advantages
No collisions possible
Guaranteed O(1)
Disadvantages
The current block has to be finished before new products can be selected

I'd go with approach #1.
You can get a first estimate of C by counting the user's rows in user_match_product (supposed unique). If he already possesses half the possible products, selecting twice the number of random products seems a good heuristic.
You can also have a last-ditch correction that verifies that the number of extracted products is actually X. If it was, say, X/3, you'd need to run the same extraction two more times (avoiding already-generated random product IDs), and increase that user's C constant by a factor of three.
Also, knowing what the range of product IDs is, you could select random numbers in that range that do not appear in user_match_product (i.e. your first stage query is only against user_match_product) which is bound to have a (much?) lower cardinality than products. Then, those IDs that pass the test can be safely selected from products.

If you want to choose X products that the user doesn't have, the first thing that comes to mind is to enumerate the products and use order by rand() (or the equivalent, depending on the database). This is your first solution:
SELECT p.*
FROM products p
WHERE NOT EXISTS (SELECT 1 FROM user_match_product WHERE ump.p_id = p.id and u_id = 123)
ORDER BY random()
LIMIT X;
A simple way to make this more efficient is to choose an arbitrary subset. You can actually do this using random() as well, but in the where clause:
SELECT p.*
FROM products p
WHERE random() < Y AND
NOT EXISTS (SELECT 1 FROM user_match_product WHERE ump.p_id = p.id and u_id = 123)
ORDER BY random()
LIMIT X;
The question is: what is "Y"? Well, let's say the number of products is P and the user has U. Then, if we choose a random set of (X + U) products, we can definitely get X products the user does not have. This suggests that the expression random() < (X + U) / P would be sufficient. Alas, the vagaries of random numbers say that sometimes we would get enough and sometimes not enough. Let's add a factor such as 3 to be safe. This is actually really, really, really, really safe for most values of X, U, and P.
The idea is a query such as this:
SELECT p.*
FROM Products p CROSS JOIN
(SELECT COUNT(*) as p FROM Products) v1 CROSS JOIN
(SELECT COUNT(*) as u FROM User_Match_Product WHERE u_id = 123) v2
WHERE random() < 3 * (u + x) / p AND
NOT EXISTS (SELECT 1 FROM User_Match_Product WHERE ump.p_id = p.id and ump.u_id = 123)
ORDER BY random()
LIMIT X;
Note that these calculations require a small amount of time with appropriate indexes on Products and User_Match_Product.
So, if you have 1,000,000 products and a typical user has 20. You want to recommend 10 more. Then the expression (20 + 10)*3/1000000 --> 90/1000000. This query will scan the products table, pull out 90 rows randomly and then sort them and choose an appropriate 10 rows. Sorting 90 rows is, essentially, constant time relative to the original operation.
For many purposes, the cost of the table scan is acceptable. It sure beats the cost of sorting all the data, for instance.
The alternative approach is to load all products for a user into the application. Then pull a random product out and compare to the list:
select p.id
from Products p cross join
(select min(id) as minid, max(id) as maxid as p from Products) v1
where p.id >= minid + random() * (maxid - minid)
order by p.id
limit 1;
(Note the calculation can be done outside the query so you can just plug in a constant.)
Many query optimizers will resolve this query constant time by doing an index scan. You can then check in the application whether the user has the product already. This would then run about X times for the user, providing O(1) performance. However, this has rather bad worst case performance: if there are not X available products, it will run indefinitely. Of course, additional logic can fix this problem.

Related

How to permutate an SQL table using a seed?

Background
I have a front-end with a list of items with infinite scrolling, and I fetch pages of items by specifying the page limit and offset.
Problem
Apart from simply ordering the result by some of the columns, I would like to add a "random" option. The thing is, I don't want repetitions, so I need to have the entire dataset permutated before doing the limit and offset, and I need to be able to get the same permutation as long as I supply the same seed.
What I tried
A naive approach was to write a table-valued function that takes an int seed and uses it in the ORDER BY clause like so:
SELECT *
FROM dbo.Entities e
ORDER BY HASHBYTES('MD2', e.Title) ^ #seed
OFFSET 0 ROWS
FETCH NEXT (SELECT COUNT(*) FROM dbo.Entities) ROWS ONLY
This seemed to work well at a first glance, but it turned out it's not very "volatile" for the lack of better word - it becomes more visible with sparse result sets, where most seeds (chosen randomly from between 0 and 2147483647) yield the same order.
I thought I would get better results by hashing the seed as well, but SQL Server doesn't allow me to XOR two varbinary variables. Am I even looking in the right direction? Are there any performance considerations that I should be making and I might not be aware of?

The best way is to create a tally table with two columns: first a sequential integer, (between 1 and 1,000,000), second a random integer number. Then generate a random number to get the first value and then make a join with a computed ROW_NUMBER().
CREATE TABLE T_NUM (SEQUENTIAL INT, RANDOM INT);
GO
WITH
N AS(SELECT 0 AS I
UNION ALL
SELECT I + 1
FROM N
WHERE I < 9)
INSERT INTO T_NUM (SEQUENTIAL)
SELECT N1.I + N2.I * 10 + N3.I * 100 + N4.I * 1000 + N5.I * 10000 + N6.I * 100000
FROM N AS N1
CROSS JOIN N AS N2
CROSS JOIN N AS N3
CROSS JOIN N AS N4
CROSS JOIN N AS N5
CROSS JOIN N AS N6;
GO
WITH T AS
(
SELECT SEQUENTIAL, ROW_NUMBER() OVER (ORDER BY CHECKSUM(NEWID())) AS ALEA
FROM T_NUM
)
UPDATE N
SET RANDOM = ALEA
FROM T_NUM AS N
JOIN T ON T.SEQUENTIAL = N.SEQUENTIAL;
GO
DECLARE #SEED INT = FLOOR(1 + RAND() * 1000000);
Now you have your seed to enter in the alea sequence then join your table on sequential order

ORDER BY HASHBYTES('MD2', e.Title + convert(nvarchar(max), #seed))
should work, but performance-wise it would be a disaster. You would calculate MD2 for all records every time. I would not do this on server side at all. You can generate random sequence on client and then just pick from server rows with row number 158, 7, 1027 and 9. But it has still two problems
if item is deleted, row number of all consecutive records shifts. It would just break the whole sequence and you would get duplicities and missing records
row number over millions of records is not that fast either
I see two options. You can query all ids from the table and use them for generating of random order. But that would be a lot of numbers. Or you have to ensure the id space is dense enough. Then you can query 20 random ids and hope at least 10 of them exist. If you are unlucky, you would have to query again.

Biased random in SQL?

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...

You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW

I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;

My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost

What would be the best way to store records order in SQL

I have a table of users profiles. Every user can have many profiles and the user has the ability to arange the order of how they will be displayed in a grid.
There are 2 tables Users and Profiles (1:M)
I've added a orderby column to the Users table where will be values like 1,2,3..
So far it seems to be okay. But when a user will change the order of the last record to be the first I have to go throught the all records and increment their values +1. This seems to me pretty ugly.
Is there any more convenient solution for this kind of situation ?

Leave gaps in the sequence or use a decimal rather than an integer data type.

The best solution is one which mirrors functionality, and that's a simple list of integers. Keeping the list in order is only a few SQL statements, and easier to understand than the other solutions suggested (floats, gapped integers).
If your lists were very large (in the tens of thousands) then performance considerations might come into play, but I assume these lists aren't that long.

How about using floating points for the order by column?
This way, you can always squeeze a profile between two others, without having to change those two values.
Eg if I want to place profile A between profiles B (ordervalue 1) and C (ordervalue 2), I can assign ordervalue 1.5 to A.
To place it on top, where before the top used to have ordervalue say 1, you can use ordervalue 0.5
There's no reason to have integers for orderby and no reason to have increments of 1 between the order of profiles.

If the data set is small (which seems to be the case), I'd prefer to use a normal list of integers and update them in batch when a profile gets a new position. This better reflects the application functionality.
In Sql Server, for the following table User_Profiles (user_id, profile_id, position), I'd have something like this:
--# The variables are:
--# #user_id - id of the user
--# #profile_id - id of the profile to change
--# #new_position - new position that the profile will take
--# #old_position - current position of the profile
select #old_position = position
from User_Profiles where
user_id = #user_id and profile_id = #profile_id
update p set position = pp.new_position
from User_Profiles p join (
select user_id, profile_id,
case
when position = #old_position then #new_position
when #new_position > #old_position then --# move up
case
when #old_position < position and
position <= #new_position
then position - 1
else position
end
when #new_position < #old_position then --# move down
case
when position < #old_position and
#new_position <= position
then position + 1
else position
end
else position --# the same
end as new_position
from User_Profiles p where user_id = #user_id
) as pp on
p.user_id = pp.user_id and p.profile_id = pp.profile_id

As a user adds profiles, set each new profile's ordering number to the previous one +1000000. e.g. to start off with:
p1 1000000
p2 2000000
p3 3000000
When reordering, set the profile's order to the middle of the two it is going in between:
p1 1000000
p2 2000000
p3 1500000
This gives the order p1,p3,p2

I think instead of keeping order in the orderby column you can introduce linklist concept to your design. Add column like nextId that will contain the next profile in the chain.
When you query the profiles table you can sort out profiles in your code (java, C#, etc)

I think the idea of leaving gaps between the orders is interesting but I don't know if it a "more convenient" solution for your problem.
I think you would be better off just updating your order by column. Because you are still going to have to determine what rows the statuses have moved between, and what to do if two statuses are switched in position (Do you calculate the new order by value for the first one then the second one). What happens if the gap between isn't large enough?
It shouldn't be that data intensive to just enumerate down the order they put it in and update each record to the order.

SQL conundrum, how to select latest date for part, but only 1 row per part (unique)

I am trying to wrap my head around this one this morning.
I am trying to show inventory status for parts (for our products) and this query only becomes complex if I try to return all parts.
Let me lay it out:
single table inventoryReport
I have a distinct list of X parts I wish to display, the result of which must be X # of rows (1 row per part showing latest inventory entry).
table is made up of dated entries of inventory changes (so I only need the LATEST date entry per part).
all data contained in this single table, so no joins necessary.
Currently for 1 single part, it is fairly simple and I can accomplish this by doing the following sql (to give you some idea):
SELECT TOP (1) ldDate, ptProdLine, inPart, inSite, inAbc, ptUm, inQtyOh + inQtyNonet AS in_qty_oh, inQtyAvail, inQtyNonet, ldCustConsignQty, inSuppConsignQty
FROM inventoryReport
WHERE (ldPart = 'ABC123')
ORDER BY ldDate DESC
that gets me my TOP 1 row, so simple per part, however I need to show all X (lets say 30 parts). So I need 30 rows, with that result. Of course the simple solution would be to loop X# of sql calls in my code (but it would be costly) and that would suffice, but for this purpose I would love to work this SQL some more to reduce the x# calls back to the db (if not needed) down to just 1 query.
From what I can see here I need to keep track of the latest date per item somehow while looking for my result set.
I would ultimately do a
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
to limit the parts I need. Hopefully I made my question clear enough. Let me know if you have an idea. I cannot do a DISTINCT as the rows are not the same, the date needs to be the latest, and I need a maximum of X rows.
Thoughts? I'm stuck...

SELECT *
FROM (SELECT i.*,
ROW_NUMBER() OVER(PARTITION BY ldPart ORDER BY ldDate DESC) r
FROM inventoryReport i
WHERE ldPart in ('ABC123', 'BFD21', 'AA123', etc)
)
WHERE r = 1

EDIT: Be sure to test the performance of each solution. As pointed out in this question, the CTE method may outperform using ROW_NUMBER.
;with cteMaxDate as (
select ldPart, max(ldDate) as MaxDate
from inventoryReport
group by ldPart
)
SELECT md.MaxDate, ir.ptProdLine, ir.inPart, ir.inSite, ir.inAbc, ir.ptUm, ir.inQtyOh + ir.inQtyNonet AS in_qty_oh, ir.inQtyAvail, ir.inQtyNonet, ir.ldCustConsignQty, ir.inSuppConsignQty
FROM cteMaxDate md
INNER JOIN inventoryReport ir
on md.ldPart = ir.ldPart
and md.MaxDate = ir.ldDate

You need to join into a Sub-query:
SELECT i.ldPart, x.LastDate, i.inAbc
FROM inventoryReport i
INNER JOIN (Select ldPart, Max(ldDate) As LastDate FROM inventoryReport GROUP BY ldPart) x
on i.ldPart = x.ldPart and i.ldDate = x.LastDate

table design + SQL question

I have a table foodbar, created with the following DDL. (I am using mySQL 5.1.x)
CREATE TABLE foodbar (
id INT NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
weight double not null,
created_at date not null
);
I have four questions:
How may I write a query that returns
a result set that gives me the
following information: user_id,
weight_gain where weight_gain is
the difference between a weight and
a weight that was recorded 7 days
ago.
How may I write a query that will
return the top N users with the
biggest weight gain (again say over
a week).? An 'obvious' way may be to
use the query obtained in question 1
above as a subquery, but somehow
picking the top N.
Since in question 2 (and indeed
question 1), I am searching the
records in the table using a
calculated field, indexing would be
preferable to optimise the query -
however since it is a calculated
field, it is not clear which field
to index (I'm guessing the 'weight'
field is the one that needs
indexing). Am I right in that
assumption?.
Assuming I had another field in the
foodbar table (say 'height') and I
wanted to select records from the
table based on (say) the product
(i.e. multiplication) of 'height'
and 'weight' - would I be right in
assuming again that I need to index
'height' and 'weight'?. Do I also
need to create a composite key (say
(height,weight)). If this question
is not clear, I would be happy to
clarify

I don't see why you should need the synthetic key, so I'll use this table instead:
CREATE TABLE foodbar (
user_id INT NOT NULL
, created_at date not null
, weight double not null
, PRIMARY KEY (user_id, created_at)
);
How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.
SELECT curr.user_id, curr.weight - prev.weight
FROM foodbar curr, foodbar prev
WHERE curr.user_id = prev.user_id
AND curr.created_at = CURRENT_DATE
AND prev.created_at = CURRENT_DATE - INTERVAL '7 days'
;
the date arithmetic syntax is probably wrong but you get the idea
How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.
see above, add ORDER BY curr.weight - prev.weight DESC and LIMIT N
for the last two questions: don't speculate, examine execution plans. (postgresql has EXPLAIN ANALYZE, dunno about mysql) you'll probably find you need to index columns that participate in WHERE and JOIN, not the ones that form the result set.

I think that "just somebody" covered most of what you're asking, but I'll just add that indexing columns that take part in a calculation is unlikely to help you at all unless it happens to be a covering index.
For example, it doesn't help to order the following rows by X, Y if I want to get them in the order of their product X * Y:
X Y
1 8
2 2
4 4
The products would order them as:
X Y Product
2 2 4
1 8 8
4 4 16
If mySQL supports calculated columns in a table and allows indexing on those columns then that might help.

I agree with just somebody regarding the primary key, but for what you're asking regarding the weight calculation, you'd be better off storing the delta rather than the weight:
CREATE TABLE foodbar (
user_id INT NOT NULL,
created_at date not null,
weight_delta double not null,
PRIMARY KEY (user_id, created_at)
);
It means you'd store the users initial weight in say, the user table, and when you write records to the foodbar table, a user could supply the weight at that time, but the query would subtract the initial weight from the current weight. So you'd see values like:
user_id weight_delta
------------------------
1 2
1 5
1 -3
Looking at that, you know that user 1 gained 4 pounds/kilos/stones/etc.
This way you could use SUM, because it's possible for someone to have weighings every day - using just somebody's equation of curr.weight - prev.weight wouldn't work, regardless of time span.
Getting the top x is easy in MySQL - use the LIMIT clause, but mind that you provide an ORDER BY to make sure the limit is applied correctly.

It's not obvious, but there's some important information missing in the problem you're trying to solve. It becomes more noticeable when you think about realistic data going into this table. The problem is that you're unlikely to to have a consistent regular daily record of users' weights. So you need to clarify a couple of rules around determining 'current-weight' and 'weight x days ago'. I'm going to assume the following simplistic rules:
The most recent weight reading is the 'current-weight'. (Even though that could be months ago.)
The most recent weight reading more than x days ago will be the weight assumed at x days ago. (Even though for example a reading from 6 days ago would be more reliable than a reading from 21 days ago when determining weight 7 days ago.)
Now to answer the questions:
1&2: Using the above extra rules provides an opportunity to produce two result sets: current weights, and previous weights:
Current weights:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Similarly for the x days ago reading:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
where Created_at < DATEADD(dd, -7, GETDATE()) /*Or appropriate MySql equivalent*/
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Now simply join these results as subqueries
select cur.User_id,
cur.Weight as Cur_weight,
prev.Weight as Prev_weight
cur.Weight - prev.Weight as Weight_change
from (
/*Insert query #1 here*/
) cur
inner join (
/*Insert query #2 here*/
) prev on
prev.User_id = cur.User_id
If I remember correctly the MySql syntax to get the top N weight gains would be to simply add:
ORDER BY cur.Weight - prev.Weight DESC limit N
2&3: Choosing indexes requires a little understanding of how the query optimiser will process the query:
The important thing when it comes to index selection is what columns you are filtering by or joining on. The optimiser will use the index if it is determined to be selective enough (note that sometimes your filters have to be extremely selective returning < 1% of data to be considered useful). There's always a trade of between slow disk seek times of navigating indexes and simply processing all the data in memory.
3: Although weights feature significantly in what you display, the only relevance is in terms of filtering (or selection) is in #2 to get the top N weight gains. This is a complex calculation based on a number of queries and a lot of processing that has gone before; so Weight will provide zero benefit as an index.
Another note is that even for #2 you have to calculate the weight change of all users in order to determine the which have gained the most. Therefore unless you have a very large number of readings per user you will read most of the table. (I.e. a table scan will be used to obtain the bulk of the data)
Where indexes can benefit:
You are trying to identify specific Foodbar rows based on User_id and Created_at.
You are also joining back to the Foodbar table again using User_id and Created_at.
This implies an index on User_id, Created__at would be useful (more-so if this is the clustered index).
4: No, unfortunately it is mathematically impossible to determine how the individual values H and W would independently determine the ordering of the product. E.g. both H=3 & W=3 are less than 5, yet if H=5 and W=1 then the product 3*3 is greater than 5*1.
You would have to actually store the calculation an index on that additional column. However, as indicated in my answer to #3 above, it is still unlikely to prove beneficial.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas